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Preface 



After Pisa in 1997 and Heraklion in 1998, ECDL will take place in Paris at the 
prestigious location of the Bibliotheque Nationale de France. It is the third of 
a series, partially funded by the European Commission’s TMR Programme, of 
European conferences on research and technology for digital libraries. Its main 
objective is to bring together researchers from multiple disciplines to present 
their work on enabling technologies for digital libraries. The conference also 
provides an opportunity for scientists to develop a research community in Europe 
focusing on digital library development. 

The program committee selected 26 papers from 124 submissions. Besides 
these, two papers were invited. One is by Jean-Frangois Abramatic, the Chair- 
man of the WWW Consortium, and the second by Robert Wilensky, Professor 
at U.C. Berkeley. 



Avant Propos 



Apres Pise en 1997 et Heraklion en 1998, ECDL aura lieu a Paris sur le site pres- 
tigieux de la Bibliotheque Nationale de France. Cette conference est la troisieme 
d’une serie de conferences sur la recherche et les technologies avancees pour les 
bibliotheques numeriques, subventionnee par le Programme TMR de la Commis- 
sion Europeenne. Son objectif principal est de reunir les chercheurs de plusieurs 
disciplines pour presenter leurs travaux sur les nouvelles technologies qui perme- 
ttent le developpement de bibliotheques numeriques. La conference fournit aussi 
I’occasion de developper une communaute scientifique dans ce domaine. 

Le comite de programme a selectionne 26 articles a partir de 124 soumissions. 
De plus, deux contributions ont ete invitees. La premiere est de Jean-Frangois 
Abramatic, le President du Consortium WWW, et la seconde de Robert Wilen- 
sky, Professeur a I’Universite de Berkeley. 
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Challenges for the Web: 
Universality and Scalability 



Jean-Frangois Abramatic 



INRIA Rocquencourt, 

BP 105, 78153 Le Chesnay, France 
http: //www. inria.fr/ 



Abstract. The Web is becoming the universal information space that 
was envisioned by its inventor, Tim Berners-Lee. To reach its full po- 
tential, the Web needs to face two major challenges: Universality and 
Scalability. Universality means that anybody should be able to access 
and publish information on the Web. Therefore, the Web should take 
into account the vast differences in culture, education, ability, material 
resources, and physical limitations of users on all continents. Scalability 
means that while millions of services are deployed on the Web, the infras- 
tructure should be able to ensure that performance, trust and relevance 
keep developing. 

The talk will present achievements and work in progress at the World 
Wide Web Consortium (W3C) that address the challenges facing the 
Web. 
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The UC Berkeley Digital Library Project: 
Re-thinking Scholarly Information 
Dissemination and Use 



Robert Wilensky 

University of California, Berkeley, 
721 Soda Hall, Berkeley, CA 94720 
http : //www. cs .berkeley . edu/ 



Abstract. Information technology is not merely provided enhanced ver- 
sions of services of the sort we have come to expect from libraries; it is 
inducing a fundamental change in the way information is created, dis- 
seminated, and used. The shift from the current centralized, discrete 
publishing model, toward a distributed, continuous, and self-publishing 
model, is already underway. However, left to its own devices, some of 
the the better aspects of the current model, such as peer review, may 
be compromised, even as the opportunity for new services is afforded. 
Effort will also be required to provide first class support in the emerging 
infrastructure for data that are not textual in nature, such as images, 
videos, maps, and scientific data sets. 

Many tools and technologies will be useful in enhancing and exploiting 
this view of the emerging information infrastructure. One set of tools 
relates to document technologies. ’’Multivalent Documents” is a new 
model of documents that seems useful in this context. The multivalent 
document model is (i) highly open, meaning that is supports an open- 
ended variety of document formats and functions, (ii) highly extensible, 
meaning that it can be extended and customized in novel ways and to 
meet particular user needs, and (iii) highly distributed, meaning that 
components of a document may exist as separate networked resources, 
which are combined dynamically into a coherent documents. A particu- 
larly attractive aspect of the model is the manner in which it supports 
’’spontaneous collaboration” , the ability of a user to annotate web pages, 
scanned images, and other networked, resources for which that user has 
no privileged relation. 

Multivalent documents address some issues in manipulating on-line re- 
sources. Finding those resources is still problematic, especially for those 
in image form. ’’Automatic content analysis” is the set of techniques for 
analyzing the content of information objects so as to facilitate their sub- 
sequent access. We present some recent developments in this area for 
accessing document images, photographs, and text. 
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Image and Metadata Distribution at Seven University 
Campuses: Reports from a Study of the Museum 
Educational Site Licensing Project 



Howard Besser' and Rosalie Lack^ 

’University of California, Los Angeles, School of Education & Information, GSEIS 
Building, Los Angeles, CA 90095-1520, USA 
howard@sims.berkeley.edu 

^California Digital Library, Office of the President, 300 Lakeside Drive, Oakland, CA 

94612-3550, USA 
Rosalie.Lack@ucop.edu 



Abstract. This paper summarizes the major findings of a University of 
California study of the Museum Educational Site Licensing Project (MESL) — 
the first large-scale multi-institutional image and metadata distribution 
experiment in the US. The study examined the costs and social impacts of 
distributing a large body of digital images and metadata from a set of different 
museums to universities. Among the findings are that the digital distribution 
environment, as a whole, appears to be good for individual image usage, but is 
problematic for group viewing situations such as classrooms. Impediments to 
widespread adoption include: lack of comprehensive content, absence of 
necessary tools to facilitate use, and inadequate recognition and support for 
faculty who adopt new technology in their teaching. Other key issues that still 
need to be addressed include: integration of consortia-provided images and 
metadata with images acquired elsewhere; allowing instructors to change 
descriptive information or annotate images; encouraging the creation of added- 
value tools; and providing particular user interfaces or new integrated tools. 
The study also compared the cost of digital distribution to the costs of running 
an analog slide library. 



Introduction 

A number of communities are interested in the viability of digital libraries. The 
University of California, Berkeley Mellon Grant study, Cost of Digital Image 
Distribution: The Social and Economic Implications of the Production, Distribution, 
and Usage of Image Data, [5] is an important step towards understanding the issues 
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likely to affect the economic viability of one specific type of digital library: a 
collection of photograph-like digital images of cultural heritage objects and their 
accompanying descriptive metadata, as distributed to university communities. 

With generous support from the Andrew W. Mellon Foundation, researchers from 
UC Berkeley undertook a 22-month study (September 1996 through June 1998) to 
examine the costs of digital image distribution for educational purposes by looking at 
an experimental project, the Museum Educational Site Licensing Project (MESL). 
The focus of this study was to identify, define, and explore the primary cost centers in 
the digital network distribution of images and text through the MESL Project. 

While a number of previous projects' had attempted to promote standards and 
interoperability in either the library or the museum worlds, the MESL Project was the 
first attempt to take a collection of images and accompanying metadata from a variety 
of museums and deliver these in digital form to university users over campus 
networks. It was a two-year experimental collaboration among seven museums and 
seven universities that distributed over 9,000 digital images and associated text for 
classroom use. 

In the UC Berkeley Study we identified major cost centers associated with the 
entire process of making the images available, from the museums’ creation of the 
images to the universities’ deployment. Through „end-of-project“ technical reports, 
we collected data on the time spent completing tasks associated with the cost centers 
(including, for the museums: content selection, image preparation, data preparation, 
and data transmission). 

We initially planned to focus all of our attention on the MESL project. As the study 
progressed, however, it became clear that a study of MESL alone was inadequate to 
understanding issues of digital image distribution and long-range viability. Because 
there had been no prior studies of costs of departmental slide libraries, we embarked 
on a set of slide library sub-studies that could provide us with comparative "baseline" 
costs in the analog world.^ And because any digital image distribution scheme would 
fail unless embraced by university instructors, we also created a study designed to 
explore the factors influencing faculty willingness to "buy in" to digital distribution 
and use digital images in teaching. 



' Most notably, the Getty Information Institute’s various standards projects (Categories for the 
Description of Works of Art, Art & Architecture Thesaurus, Thesaurus of Geographic 
Names, Union List of Artist Names, etc.) and the Consortium for the Interchange of Museum 
Information (CIMI) Project. 

^ While there had been no significant baseline economic studies of departmental slide libraries, 
there are a number of important studies of the costs of running central library services for a 
campus. These central library studies may prove valuable in the future as part of the baseline 
comparisons for centralized digital image library collections. 
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The resulting study compares costs of the MESL distribution method to previous 
analog methods of image distribution via 35mm slide library collections. The study’s 
July 1998 Final Report[5] discusses the advantages and disadvantages of digital 
image distribution, and identifies impediments to user acceptance of digital image 
distribution. 

From our study we have learned that, in its technical and operational details, the 
MESL distribution was a very special case that will not be repeated. However, the 
goal of museum/university digital image distribution is important to most of the 
parties involved, and we believe that there will be future attempts at digital image 
distribution that will follow other models. Our study has uncovered many features of 
analog slide libraries and how they are run. Some of this information will be 
important to architects of future distribution systems. We have also uncovered ways 
in which analog slide libraries are quite different from digital image distribution 
systems, and we expect that the two will exist side-by-side for many years to come. 

Comparisons between emerging digital image distribution systems and any 
existing entity are limited, at best. Though their content resembles analog slide 
libraries, their funding schemes and organizational structures and settings will not. 
Digital image distribution schemes serving universities will require different types of 
institutional roles and responsibilities. As Clifford Lynch has said, their success will 
depend upon a complicated set of issues involving institutional readiness, 
commitment of campus-wide acquisition budgets, centralized support of 
infrastructure, and a host of other issues common to the introduction of digital 
collections. [7] This makes predictions about their future difficult at best. 

This paper provides a summary of our study. Here we discuss the advantages and 
limitations of extrapolating from a study of MESL, general findings from the MESL 
Project as a whole, important issues emerging from our focus group interviews with 
faculty, discoveries from the study of analog slide libraries, and important findings 
from our cost center analysis. We also make a number of general observations about 
what we regard as important issues in understanding the visual resources 
environment. 



Advantages and Limitations of Extrapolating from a Study of 
MESL 

The promise of digital distribution is increased accessibility and the potential for 
enabling new uses. MESL was designed as an experiment to test whether increased 
access to images through digital distribution was feasible. The initial MESL proposal 
sought to use "digital imaging and network technologies" to "make cultural heritage 
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information more broadly available." It cited two basic objectives: (1) to develop, test 
and evaluate procedures and mechanisms for the collection and dissemination of 
museum images and information, and (2) propose a framework for a broadly based 
system for the distribution of museum images and information on an ongoing basis to 
the academic community. 

Although the goal of this study was not specifically to assess whether MESL met 
its stated goals and objectives, it is very clear that as a feasibility study MESL was a 
success — it demonstrated that a large number of cultural heritage images and 
accompanying metadata could be distributed over campus digital networks. On the 
other hand, the success of MESL, as a demonstration of an ongoing, accessible, and 
multi-institutional database of digital art images is unclear. The MESL Project did not 
shed light on how institutions will organize the acquisition and management of image 
collections on an ongoing basis, and it highlighted serious consistency and 
accessibility issues. These are part of a fundamental set of differences between 
creating a true integrated digital library and simply merging records from different 
repositories. [2] [4] 

The MESL project appeared to provide an excellent opportunity for a study, 
because it was the first attempt at large-scale distribution of digital images to the 
educational community. The potential of studying seven different university 
distribution implementations involving images and text coming from seven different 
collections seemed to offer the potential to compare different approaches to solving a 
similar problem. 

But several of the factors that made MESL a rich environment to study also made 
it extremely difficult to obtain consistent and useable data; some of the heterogeneous 
environments were so different that it was not even possible to find common units of 
measurement to make comparisons. Our study, therefore, chose to focus on cost 
centers that were likely to be present as steps in the distribution chain (from the 
selection of the original object to the point where its digital representation reaches the 
end user) in most potential distribution models. For each cost center at each step, our 
study estimated costs, primarily in terms of hours committed to accomplishing the 
tasks. (Because dollar costs from MESL's heterogeneous institutions are difficult to 
identify with any degree of certainty, and are likely to fluctuate radically among 
institutions and over time, we avoided citing them.) And because MESL was a short- 
term project, its costs were not reflective of the "steady-state" costs which would 
accompany an ongoing long-term project. 

A key problem with trying to extrapolate from the MESL project is that, of 
necessity, MESL planners chose only one of many possible distribution/delivery 
models. In this case, each museum prepared images and text according to rough 
specifications. This data was then sent to a central site in Michigan and checked for 
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delimiters there. Next, the universities engaged in similar procedures of merging the 
data from the seven museums, importing the text into their own database structures, 
indexing the text in different ways, converting images into the sizes they wanted to 
support (including thumbnails), etc. The MESL distribution model was designed for a 
demonstration project; it is highly unlikely that this model would be used in 
production mode. Follow-on projects would likely involve either direct delivery to 
end users from a single central site that also takes responsibility for data integrity 
(instead of the MESL model which involves local mounting at each campus and 
barely addresses data integrity issues), or complete local mounting of data obtained 
from a variety of sites (the analog slide collection model). This is yet another reason 
why we chose to focus on relatively broad cost centers that may shift between a 
museum, a central facility, or a university (depending on the distribution/delivery 
model). And even though the costs in any given center may increase or decrease (even 
to near zero) depending upon the model, the cost centers are broad and critical enough 
that just about every one of them is still likely to exist somewhere within any model 
in the foreseeable future. 

Extrapolations from a study of MESL must be limited for a variety of reasons. The 
procedural organization of the MESL distribution model is not likely ever to be 
repeated. The project was begun ahead of its time on the technology curve, and many 
of the technologies and procedures attempted were then experimental or unknown. It 
was a voluntary project with few incentives for the participants to act or respond as 
they would in an ongoing distribution arrangement. And gathering data for analysis 
was problematic because of differing units of measurement. 

We view our primary contributions as identifying the major cost centers and 
providing methods for examining each of these cost centers by estimating their 
associated time commitments, while acknowledging contextual variability such as 
learning curves. The study also identifies the human knowledge, background, and 
organizational infrastructures needed to create, distribute, and mount the digitized 
materials in order to complete the MESL project. By identifying and isolating these 
cost centers, as well as the knowledge and resources needed to accomplish each phase 
of digital distribution, we believe that we can provide an important framework for 
future projects, even those using models where any given cost center may move into a 
different type of organization. Most importantly, we have also identified a variety of 
impediments that must be overcome before digital image distribution schemes are 
widely adopted. 

Because this study examined relatively virgin territory, it was necessary to explore 
the uncharted surrounding areas of image delivery in order to make sense of what 
digital distribution modalities mean. To effectively understand MESL’s digital 
distribution process and costs, we needed to understand the analog distribution model 
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that has been provided by 35mm slide libraries. By compiling cost and distribution 
data from six slide libraries at five university campuses, we performed the first 
extensive cross-institutional study ever done of slide library costs. These analog 
distribution costs form an important foundation for examining digital distribution 
costs. However, because the role and functionalities of analog slide libraries differ 
from those of digital distribution schemes like MESL, and because sustainable digital 
distribution systems will differ markedly from experiments like MESL, we would 
strongly caution against direct comparisons. 

The MESL Project has shown us that merely mastering the complex technical 
process of delivering images and text to the desktop does not by itself make such a 
system viable. A number of accessibility issues are hidden within that process, such as 
how to get a set of repositories to adopt standards (and how to enforce 
standardization), how to ensure consistent use of data values between repositories, 
how to map terminology into the vernacular for users, how to determine what kind of 
user interfaces and searching capabilities are needed, etc. In addition, many critical 
issues exist outside of that technical delivery process, including how to provide 
enough of the images that users actually need, and how to ensure that instructors who 
invest in curriculum development will continue to have the rights to ongoing access to 
the images they develop their courses around. 

From MESL we have learned that a digital distribution system is a very complex 
and interlinked process. Its viability depends on supply, accessibility, and demand; 
images need to be available, easily accessed, and, perhaps most importantly, wanted 
and needed by the intended users. The various parts of the system are dependent upon 
decisions made in the other parts (e.g. usage depends upon image and metadata 
quality, critical mass, delivery and infrastructure, etc.). Examining any single function 
in isolation (such as image storage and digital database access) would lead to 
misunderstandings and misrepresentation of the system. 



General Findings from the MESL Project 

MESL demonstrated that while some factors encourage the use of multi-institutional 
digital image databases of cultural heritage objects, there are also significant barriers 
to their widespread use. The following are critical observations and suggestions in the 
areas of viewing, content, searching and access, technology, infrastructure, and 
policy. 

1. The digital distribution environment, as a whole, appears to be good for 
individual usage, and provides access from multiple locations. Most users’ home 
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environments are currently inadequate for comfortable use of digital images, but 
that should change with increased bandwidth, processing power, and screen size. 
We are even beginning to see wired dormitories as part of campus networks. 
Shifts to off-site use may alleviate the need for more on-campus computer labs, 
but will require more sophisticated user authentication systems. However, groups 
that are not central to the university mission (e.g. alumni, visiting faculty, other 
visitors) which currently enjoy walk-in access to analog resources may lose 
access to these additional resources altogether, as authentication systems and 
licensing arrangements for digital materials become able to distinguish more 
finely between user groups. 

2. Digital image distribution in its existing form is problematic for group viewing 
situations, such as in the classroom, where analog delivery is simple, fast, cheap, 
dependable, and requires little technological infrastructure. Electronic 
classrooms, computing and network infrastructure, technical and instructional 
support, and image quality issues need to be addressed before digital distribution 
to the classroom becomes viable. 

3. The lack of comprehensive content made the database extremely problematic for 
coursework purposes. For a digital image distribution scheme to be successful, a 
repository must be able to provide a critical core of important images, what 
Clifford Lynch has called a "reference collection. "[7] Most significantly, the 
definition of "critical core" is likely to be dynamic. New approaches to 
disciplinary understanding are constantly changing what is considered to be 
central material for pedagogical purposes (for example "popular art" and "art and 
gender"). For most users, even a critical core will not offer a comprehensive 
corpus. Many faculty teaching with MESL images vocalized a need for a "critical 
mass" of images that would approach the corpus size of their analog slide 
libraries. 

4. Because faculty content needs can be robust and shifting, a digital image 
distribution scheme will almost certainly also need to give faculty the option of 
integrating locally produced material. (Many MESL universities reported having 
to supplement the MESL database with custom images drawn from their slide 
libraries.) Future systems must be both extensible and easy to supplement. 

5. MESL content attempted to be responsive to faculty needs. A content selection 
process solicited faculty input. This kind of active connection between content 
selection and instructional programs is not likely to scale up. Collection 
development for future systems will probably be museum-centric, with museums 
defining a core reference collection that they will distribute. 

6. The different metadata vocabulary and general language used by different 
institutions made the creation of an integrated and consistent database 
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problematic at best. It is glaringly evident that a project like this needs guidelines 
and standards at many levels (from field delimiters to controlled vocabulary),^ 
and that the standards developed within MESL were not, by themselves, enough. 
And it is likely that this type of problem will increase as the corpus or domain of 
coverage scales up. The MESL data dictionary managed to map actual field 
names into a common exchange format, but the project neither addressed what 
those field names meant to the body of end users, nor addressed the differing 
ways in which the contributing repositories used vocabulary within a given field. 
And since most object metadata was taken from collection management system 
records, most vocabulary was in the language used by museum curators and 
registrars. Digital distribution schemes like this could be much more effective if 
we better understood vocabulary issues in general: how to translate the 
specialized vocabulary used by specialists into the vernacular used by general 
users, and how to better map between the various knowledge organization 
frameworks of different domains. 

7. The interface and the ability to query and manipulate the database is critical for 
future use. Additional tools for examining, organizing, and saving retrieved sets 
are also necessary. The MESL model of localized control over distribution 
discouraged development of expensive retrieval systems. A more centralized 
model would be able to spread the development costs over a wide body of sites, 
and would likely lead to better retrieval tools. But local customization of such a 
system may still be desirable, and this poses an interesting research issue in 
system design. 

8. At the time of the MESL project, storage space was becoming less of an 
impediment, but network speed and bottlenecks at routers and servers may 
remain an issue for image and multimedia delivery, particularly if users are 
accessing remotely mounted collections. 

9. Universities appear capable of constructing systems that can provide some 
security and protection for intellectual property. Recently developed university 
user authentication systems seem to be good enough to meet today’s museum 
requirements. But in the long run museums may want control over image use 
(such as copying and reposting) rather than the control only over initial image 
access that today’s security systems provide. 

10. Humanities departments tend to be underfunded and technologically 
inexperienced compared to engineering and science departments, even at 



^ We do not mean to imply the necessity of adherence to one single standard in all cases at 
every level. With controlled vocabulary, for example, a solution might emerge employing 
"crosswalks" between a limited set of controlled vocabulary lists, with each institution 
adhering to one of those lists (rather than all institutions conforming to a single one). 
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technologically advanced universities. Though the MESL Project did hasten the 
arrival of wiring to some humanities buildings, workstations to humanities 
faculty desks, computer projection to humanities classrooms, and computing 
support to humanities departments, these departments may continue to lag behind 
in speed, processing power, and training needed to use new systems. 

11. There is much university enthusiasm for the use of digital surrogates for cultural 
heritage material, but many problems must still be addressed before there is 
widespread end-user acceptance. Instructors are particularly concerned about lack 
of departmental recognition for what, in their experience, has been a vastly 
increased workload from teaching with digital images. Tools need to be 
developed to make digital images easier to use and particularly to make it easier 
to use them to build curriculum material. But who in the institution will have the 
responsibility, funding, and expertise to develop these tools is a serious question. 
Will this be the responsibility of the central library, the departmental library, 
central computer services, or the individual instructor? 

12. The MESL Project has shown that universities and museums have common 
interests in providing images and metadata to users. Though conflicts arose 
periodically, in general the project proved that they have more common than 
divergent interests and can work together well. But new areas of conflict may 
arise (such as when faculty present enhanced content back to the museum, or 
when faculty want to distribute new products they create using museum images). 

13. Audiences for a museum’s digital information also exist outside the university 
community. All involved hope that museums can leverage their efforts at digital 
distribution to universities to help them deliver to additional audiences. But 
museums need to take into consideration the special needs of those additional 
audiences, paying particular attention to the need for different descriptive 
vocabulary and for organizing sets of images contextually. For instance, the large 
K-12 community probably needs thematic arrangements of images complete with 
descriptive information in vocabularies much different than those of curators or 
art historians. Museum consortia should consider encouraging teachers and others 
to create added- value packages that can then be redistributed to others. 

14. Copyright issues are significant. Museums tend to be cautious about distributing 
digital images of works unless they are absolutely certain about rights clearance 
on the original work (though through the MESL project some museums became 
less strict about this). Because current copyright law leaves reproduction rights 
for original works with the artist’s estate for some period after the artist’s death, 
and because most museums have not explicitly obtained digital reproduction 
rights when acquiring a work for their collection, very few 20th-century works 
will be distributed in digital form by museums for some time to come. Proposed 
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legislation on databases and on works in digital form may greatly affect projects 
involving the distribution of digital image collections. 

15. The value that museums can provide to universities in projects like MESL may 
lie more in the authoritative metadata than in the digital images themselves, and it 
is a mistake to view these as merely imaging projects. The expertise of the 
museum in the form of authoritative metadata describing an object and its context 
is critical for scholarly research, and appears to be something the museum can 
copyright, own, and sell. This situation stands in contrast to the lack of agreement 
in the legal community on whether digital images of art works in the public 
domain are copyrightable. 



Comparisons with Analog Slide Libraries 

The study of analog slide libraries has shed some light on certain functions that exist 
in the analog versus the digital distribution environment, and has also demonstrated 
how certain cost centers may differ between these environments. Our study of the 
analog environment was not extensive enough to answer all the important questions, 
but it answered some and suggested further comparative studies that should be 
undertaken: 

1 . Analog slide libraries provide a valuable set of services, some of which would be 
lost in currently emerging models for digital distribution. Slide libraries are 
customized for their local environments and metadata is customized to meet local 
needs. Acquisition is end-user driven, and responds quickly to local user 
demands. A research agenda for digital distribution schemes should consider how 
future models might support these types of services. 

2. Not all images needed by university users are of the sort held by museums. 
Therefore the university’s image needs extend beyond what can be met through 
museum consortia. Cultural heritage slide libraries often include images from 
architecture, religious structures (churches), popular culture, private collections, 
public site-specific art (cemetery art, monuments, fountains), lesser-known and 
local artists, and community-based art (such as murals). Collections also 
frequently include other types of images to provide context for a time period, 
place, style, or theme. 

3. Analog slide libraries are primarily based in individual campus departments. 
Digital distribution schemes, however, are likely to be housed in or contracted by 
campus-wide units. Therefore funding schemes and institutional roles and 
responsibilities will be much different than the departmental models that 
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characterize slide libraries. This makes comparisons and predictions very 
difficult. 

4. As we discover some of the types of functionality that users of analog slide 
libraries find useful and perhaps necessary (such as slide-sorting functions), we 
can outline some of the functions that digital image libraries are likely to need in 
order to attract and retain users. Further study of user behavior in selecting and 
arranging slides for classroom presentation will be helpful in determining 
functional requirements for desktop toolsets. 

5. Circulation statistics from analog slide libraries can give us benchmarks against 
which to compare likely overall use of digital image libraries, and indicate likely 
periods of heavy use. We know that digital delivery removes time and location 
constraints that limit analog slide use, and we expect that use will increase once 
digital delivery systems are adopted by users. We also know that use will 
increase as these collections begin to serve users outside of core departments. As 
long as digital image collections still strain systems resources, this analog use 
data can help system architects and planners by suggesting times and levels of 
high activity. 

6. Our study has revealed a small but significant group of analog slide users that 
come from outside the primary slide library community. We can expect that these 
numbers will increase in a digital world where gaining access does not involve 
visiting an analog slide library located in a particular academic department. 

7. We know that some analog slide library costs (such as re-filing) are just not 
applicable to a digital environment. If we know how significant those costs are, 
we can begin to discuss likely cost savings in a digital environment. Our analog 
study suggests that while the amount of time involved in re-filing is significant, 
the actual cost of this effort is not great due to the use of low-paid personnel. But 
we have no idea of the impact of misfiled or lost slides on scholarship. 



Cost-Related Findings 

Our study of MESL identified broad cost centers for the image providers in the 
preparation process (content selection, image preparation, text data preparation, image 
transmission, and text data transmission) as well as for the image distributors in the 
delivery and deployment processes (preparing images, preparing structured text data, 
preparing unstructured data, creating functionality tools, providing security/access 
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control, outreach, usage training, and technical development)'*. This study of MESL 
cost centers and related studies of analog slide libraries and faculty attitudes shed light 
on a number of interesting issues. Administrators and others involved in planning 
digital distribution should be particularly interested in the following observations: 

1 . University administrators are very concerned about controlling content costs, and 
faculty are concerned about ensuring access to images around which they can 
build curricula. These positions put them in conflict with museum image 
distribution consortia that want an ongoing stream of revenue and are 
understandably reluctant to guarantee ongoing access without payment. Positions 
must change for consortia efforts to be successful. (For example, museums may 
decide to heavily subsidize consortia efforts with revenue from traditional 
sources such as licensing images to publicity agencies. Or university 
administrators may decide that they can perpetually commit funds for licensing 
images from museum consortia.) Recent initiatives such as the Association of 
Research Libraries’ Scholarly Publishing & Academic Resources Coalition[8] 
may offer interesting models to follow. 

2. It is critical to note that museums and universities both own large collections of 
images, and that both make extensive use of images as part of meeting their 
mission. These two types of institutions have more in common than not, and 
attempts to improve the position of image holders and image users vis-a-vis one 
another is likely to hurt both types of institutions. Emerging models for digital 
image distribution should not treat these two types of institutions as adversaries. 

3. Though the MESL project did not address the topic, we know very little about 
what is required to maintain any kind of digital repository over long periods of 
time.[6][3] We need more studies of costs and issues involved in maintaining 
general bodies of digital information, as well studies specific to maintaining large 
groups of digital images and their associated metadata. Such studies must take 
into consideration both technical issues for maintaining the information in a 
useable format, and support and management issues in providing access to that 
information. And while we still know very little about ongoing upkeep and 
maintenance costs for a digital repository, our study did shed some light on how 
certain ongoing costs decline over time while others rise (see point #8 below). 

4. It will be a long time before digital image repositories will be able to deliver to 
users the critical mass of images needed for instruction and research. (Though 
critical mass size is difficult to estimate and will vary among potential user 
communities, for most user groups we expect that critical mass will have to 



'* In future distribution schemes some of the MESL distribution functions may well be handled 
by local entities. 
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exceed the holdings of a moderate-size analog slide library-250,000 images.) It is 
clear that analog slide libraries and digital image repositories will coexist for 
many years into the future. Planners must assume a hybrid form of image 
distribution, and may be hard-pressed to determine how best to allocate resources 
between the two. In cases where analog slide libraries exist within the same 
organizational structures as digital collections (e.g. libraries), there may be a 
strong impulse to redirect funding from the former to the latter, with serious 
consequences for the overall mix of resource availability. But if, instead, central 
library acquisition budgets are used to underwrite image collections, this may 
affect other areas of collections development. Libraries, slide collections, and 
administrators would be well served by joining together to articulate an overall 
strategy for image provision in this transitional phase which acknowledges the 
competition for scarce resources. 

5. While analog slide libraries have been managed primarily by individual 
departments, digital collections are likely to be managed by campus-wide units 
(such as libraries, computer centers, and instructional technology units). This is 
likely to force changes in resource allocation, as well as in subject-matter 
specialization and support services. 

6. Museum consortia planning digital distribution expect to deliver images and 
metadata directly to the user’s desktop, rather than having universities act as 
redistributors as in the MESL Project. While there are compelling reasons to 
follow this model (better control, lack of duplication of the tremendous effort of 
local mounting, etc.), consortia implementers would be wise to consider the 
provision of some local mounting and control functions. University users are 
likely to expect the features and capabilities they have with analog slides to be 
available with digital images as well. Key issues for the universities include how 
to integrate consortia-provided images and metadata with images acquired 
elsewhere; how to allow instructors to change descriptive information or annotate 
images; how to encourage the creation of added-value tools; and how to provide 
particular user interfaces or new integrated tools (such as slide sorting, saved sets, 
image overlays, or image comparisons) to a group of campus users. In many 
ways this issue of local versus central mounting is similar to the issue of whether 
university libraries should mount copies of scholarly journals or arrange for their 
users to get these directly from the publisher’s website. But a key difference is 
that contemporary scholarly practice for the cultural heritage community requires 
many images outside a central corpus, and instruction in this community 
frequently requires supplemental descriptive information. 
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7. The infrastructure cost of providing and maintaining a digital distribution system 
is significant, and well beyond the current budgets of analog slide libraries.^ 
Digital distribution of cultural heritage images can only be viable in the near 
future if infrastructure costs (such as high-speed networks, classroom projection 
and workstation labs, and workstations themselves) are spread across academic 
units or are budgeted for at levels higher than the departmental, for instance as a 
part of a university-wide technology initiative. 

8. Many of the costs during the MESL project were start-up and learning costs that 
would not be incurred on an ongoing basis. But the data suggests that some types 
of costs (such as support) will only decline slightly in subsequent years. Cost 
centers that would normally tend to decrease with greater experience may in fact 
increase as the underlying technology periodically changes and brings new costs 
to bear. (For example, changes in system architecture, such as moving to a new 
better/faster underlying database, may cause an increase in text and image 
preparation costs. We saw significant cost increases when MESL sites changed to 
Web-based delivery.) And certain costs will increase because the size and scope 
of the project increases (for example, delivery to a larger user population, such as 
alumni or other non-students, may necessitate a much more sophisticated — and 
expensive — approach to security). 

9. A cursory look at analog distribution costs may be deceptive; accurate costs 
should be balanced against potential use. As studies of electronic versus print 
journals have speculated, costs for electronic resources need to be weighed 
against a very different access parameter than that used for analog resources. 
Although analog systems may be cheaper to maintain, there are many more 
potential users for a digital system than for an analog one. 

10. Digital image distribution models can provide access to materials that have had 
only limited accessibility in the past. But from the MESL data it is still not clear 
whether digital access is likely to be cost effective anytime soon. But we do agree 
with Bates that, in the long run, "for the same dollar expenditure (as in pre- 
technological environments) learning effectiveness can be increased, or more 
students can be taught to the same level of investments."[l] In other words, we 
are skeptical about costs significantly diminishing (although we believe that they 
may move to other points along the chain), but we are optimistic about 
technology leading to more widespread learning. 



^ For example, the entire $27,000 annual operational budget of one of the slide libraries in our 
study couldn’t possibly cover the likely infrastructure costs of creating a digital distribution 
system (where just functionality in the first year cost $24,000). 
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Conclusion 

We believe that the MESL Project was one of the first steps in a transition towards 
digital image libraries, and that digital collections may eventually replace analog slide 
libraries. Our study of MESL has revealed some of the differences between slide 
libraries and digital distribution schemes, and has identified some of the problems that 
must be resolved before digital image distribution is widely accepted. This study has 
uncovered important information for designers of digital image distribution schemes. 
We have highlighted issues of cost, content, infrastructure, and user acceptance. We 
have shown the serious access issues that emerge from combining text records from 
museums that use different forms of vocabulary control, and have demonstrated that 
different distribution approaches towards indexing can yield vastly different search 
results. We have noted how analog slide libraries differ from any digital image 
distribution scheme proposed thus far. And we have expressed concerns about where 
digital image distribution schemes might fit within an institutional hierarchy. 

We believe that, in the long run, it will be difficult to financially justify repetitive 
isolated collections of images on different university campuses. Yet, the tailoring of 
local collections to local needs (provided by analog slide libraries) is critical to the 
current instructional environment. We think that it is important that analog slide 
libraries and digital image distribution consortia coexist for many years to come. But 
we are very concerned that university administrators will be unwilling or unable to 
support the financial burden of such hybrid systems. 

We feel that this is a propitious time for this study, as two museum consortia are 
currently developing plans to distribute digital images to the museum community. 
Both the Art Museum Directors Association’s AMICO project and the American 
Association of Museums’ Museum Digital Licensing Consortium are currently 
designing their distribution and delivery schemes. Their business plans could benefit 
from a better understanding of the cost centers and efforts involved in making digital 
images and accompanying text available to the educational community. Thus far these 
consortia have focused their attention on framing issues (such as terms and conditions 
of use). Their work on models for production, processing, and deployment, as well as 
development of cost models, can be greatly informed by the results of the Mellon- 
sponsored study. 
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Abstract. The rapid expansion of multimedia digital collections brings 
to the fore the need for classifying not only text documents but their em- 
bedded non-textual parts as well. We propose a model for basing classih- 
cation of multimedia on broad, non-topical features, and show how infor- 
mation on targeted nearby pieces of text can be used to effectively classify 
photographs on a first such feature, distinguishing between indoor and 
outdoor images. We examine several variations to a TF*IDF-based ap- 
proach for this task, empirically analyze their effects, and evaluate our 
system on a large collection of images from current news newsgroups. In 
addition, we investigate alternative classification and evaluation meth- 
ods, and the effect that a secondary feature can have on indoor/outdoor 
classification. We obtain a classification accuracy of 82%, a number that 
clearly outperforms baseline estimates and competing image-based ap- 
proaches and nears the accuracy of humans who perform the same task 
with access to comparable information. 



1 Introduction 

As digital collections on the World Wide Web, corporate intranets, and CD- 
ROMs increase vastly in size and availability, it is becomming increasingly im- 
portant to find efficient methods of categorizing not only text documents but 
also images, video, sound files, and other multimedia embedded within a docu- 
ment. Work in information retrieval has focused primarily on text, and then on 
classifying an entire document as relevant to a particular query or as a mem- 
ber of a specific class. Yet, much is to be gained by independently categorizing 
and indexing pieces of a document from different media; multimedia information 
arguably follows a different classification hierarchy than text, and more factors 
than topical relevance come into play when an image or other non-text data is 
included within a document. For example, a news article on the recent events in 
Kosovo may include a picture of an airplane at a U.S. base, even though that 
particular aircraft never participated in the operations described in the article. 
The same image can frequently be found in multiple related documents, and, 
conversely, an independent classifier of images could help select an image from a 
broad, separate collection for illustrating a summary of a text-only source. Un- 
desirable images (e.g., advertisements) could also be detected and pruned before 
a document is displayed to the user. 
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In the present work, we explore such an independent classification for im- 
ages, using information from associated text sources such as captions and the 
surrounding text in the document in which the image is embedded. We are 
informed and motivated in this endeavor by the parallel development of a mul- 
timedia, multiple document summarizer (Aho et al. 1998), where appropriate 
images can enhance the text summary. Our approach centers on the develop- 
ment of a suitable class hierarchy of broadly applicable visual features that will 
facilitate the selection of appropriate images for such summaries, even when fine 
distinctions (such as the subject matter of the image) are not available. Such 
features include classifying the images as indoor or outdoor; as containing one 
or a few persons or a crowd or no people at all; and as depicting a natural land- 
scape versus a city scene. If independent classifiers can be designed for these 
features, then we can infer the appropriateness of the image for a particular de- 
scriptive purpose with high likelihood given only a little domain guidance. For 
example, an outdoor image with no people from the terrorism news domain is 
likely to show the scene of an event or its aftermath, while an indoor photograph 
with a crowd of people probably refers to a related press conference. Additional 
techniques can refine these inferences, by using for example information extrac- 
tion methods (Wacholder et al. 1997) to identify the location of an event or the 
names of specific participants in the images. 

We report in this paper on our methods and results for classifying images as 
indoor versus outdoor. We chose this visual feature as a basis for a first division 
of the images because of its plausibility as an indicator of image content and 
because it is used as a high-level feature in image ontologies for image and digital 
signal processing (Vailaya et al. 1999). It is also a feature for which purely visual 
classifiers can be built (Szummer and Picard 1998); in fact, we are developing 
such classifiers in parallel with the text-based ones described here, and we plan 
to investigate ways to integrate them in the future. Although we have focused on 
this category, the methods described in the paper are independent of the specific 
feature and can be applied to any of the broad categories identified earlier.^ 

Our indoor/outdoor classifier for images is based on information retrieval 
measures of text similarity, such as term frequency and inverse document fre- 
quency (TF*IDF) (Salton and Buckley 1988; Salton 1989). Unlike information 
retrieval, however, we have to work with small pieces of text (a caption or a por- 
tion of a caption). Hence, we examined and evaluated several potential improve- 
ments to standard IR techniques, such as using targeted parts of the available 
text, limiting ourselves to particular word classes, and partially disambiguating 
words according to their part of speech. We collected a large sample of 1,675 
images for training and evaluation, and had multiple human volunteers assign 
indoor/outdoor labels to them. We measured individual human performance on 
this task against the standard implied by their agreement, and compared our sys- 
tem’s performance to the humans, a default baseline classifier, and image-based 
classifiers that operate on purely visual features (e.g., color, texture, and edge 

^ The main cost for moving on to new categories involves the necessary manual labeling 
of a large set of images for training and evaluation. 
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direction features). We optimized our classifier using three-fold cross-validation, 
varying several of the TF*IDF parameters and optional features and determin- 
ing which of the features have a major effect on performance. Using probability 
density estimates for the output of the classifier, we are able to correct several 
potential misclassification errors. Our results show that the automatic system 
clearly outperforms the baseline and image-based classifiers, approaching the 
accuracy of the human volunteers. We extend these results by considering alter- 
native evaluation methods and the effects of lenient versus strict definitions of 
the indoor and outdoor categories. We also explore another method for identify- 
ing words that discriminate between the two categories and measure the effect of 
additional high-level features (in this case, the number of people in each image) 
for indoor/outdoor classification. 

2 Related Work 

Our classification approach draws on a long line of work for measuring text 
similarity, mostly in an information retrieval context. Most of the informa- 
tion retrieval approaches rely on single words (e.g., (Salton and Buckley 1988; 
Salton 1989)), although sometimes compounds and collocations have been used 
(Smeaton 1992). Some of the features we explore (e.g., ignoring capitalization) 
are also used by default in most IR systems. Other, more natural language- 
informed features have found mixed success in information retrieval (e.g., 
(Salton and Smith 1989; Gay and Croft 1990; Smeaton 1992)), although the use- 
fulness of each feature needs to be evaluated separately for each application 
(classifying image captions is different than classifying entire documents). 

For topical image classification, keywords extracted from a document have 
been used to index an associated image (Bachet al. 1996; Smith and Chang 1997), 
and image similarity has been measured on the basis of shared image features 
(Niblack et al. 1993; Pentland et al. 1994) and by a combination of textual and 
image feature matches (Ogle and Stonebraker 1995; Smith and Chang 1997). 
Rowe and Guglielmo (1993) and Smeaton and Quigley (1996) use information 
from captions for retrieving (rather than classifying) images given a query. Sri- 
hari (1995) uses face detection techniques along with name extraction from cap- 
tions for linking images to specific people. Classification of images along broad, 
non-topical features such as those we are exploring has received less attention in 
the image processing literature, although this is beginning to change. Forsyth and 
Fleck (1996) present an image-based detector for naked people, while Szummer 
and Picard (1998) describe an approach for separating consumer photographs 
into indoor and outdoor classes. Both of these approaches utilize as their input 
only low-level visual features, such as color and edge direction. 

3 Data Set 

Our raw data set consists of 21,086 news postings from April 1997 to May 1998 
from a variety of Usenet current news newsgroups. Of these, 1,490 contain, in 
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addition to a text article, one or more embedded images, each with an associated 
caption. Captions are generally two to four sentences long. The first sentence 
in the caption tends to describe the image, while the remainder usually gives 
background information and establishes the relevance of the image to the story. 
For example, 

BANGKOK, THAILAND, 9-NOV-1997: New Thai Prime Minister Chuan 
Leekpai gives a traditional “wai” to thank members of his party appplauding 
his entrance, November 9, during a ceremony appointing him as the country’s 
23rd prime minister in Bangkok, Thailand. Chuan was named prime minister 
for the second time, replacing Chavalit Yongchaiyudh at the helm of a country 
plagued by economic woes. 

For training and testing, a web-based interface was set up allowing volun- 
teers to label images according to two high-level features. The first feature cor- 
responded to the indoor versus outdoor dichotomy, and the choices given were 
Indoor, Outdoor, Likely Indoor, Likely Outdoor, and Ambiguous. The second 
feature was number of people, and the available choices were No People, One 
Person, Two People, Three or More People, Crowd, and Ambiguous. In both 
cases, the authors went over a sample of images in advance, identified potential 
problems, and supplied the evaluators with detailed instructions which can be 
viewed at http://www.cs.columbia.edu/~sable/research/readme.html. 

Using our interface, fourteen volunteers labeled the images under different 
access conditions: by viewing the image alone, the caption alone, both the image 
and the caption, or just the first sentence of the caption. Each image received 
two such labels under the full access condition (when volunteers viewed both the 
image and caption), which we consider representative of normal use of the images 
in multimedia documents.^ We use the labels obtained for this condition as the 
basis for both our training and testing sets. A single label was obtained for each 
image under the other conditions; these are used to estimate human performance 
and to compare with our system (which uses only text information) . 

For the indoor versus outdoor distinction, analysis of the assigned labels re- 
veals that in most cases (87.7%), a definite indoor or outdoor judgement was 
made, and only 3% of labels assigned were “Ambiguous” . Agreement between 
humans was also high (90.4% of the images had compatible labels, although 
sometimes with different degrees of confidence). There was, however, some dis- 
agreement between human categorizers. 137 images had labels that differed by 
more than one step on the scale from “definite indoor” to “definite outdoor” , and 
39 of them had in fact one “definite indoor” and one “definite outdoor” label. 
Our analysis of the labels for the number of people feature indicated a somewhat 
lower but still significant level of agreement (80.4%). Inspection of the images 
that received conflicting labels reveals that several of the disagreements are due 

^ Four volunteers labeled the images under this specific access condition. One provided 
labels for all images, while each of the other three provided a second label for a third 
of the images. Thus, each of the 1,675 images received exactly two labels under this 
condition. 
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Fig. 1. An image that is hard to classify as indoor or outdoor. 



to mistakes by the categorizers, but in some cases, even markedly different la- 
bels can be attributed to different opinions about how terms like “indoor” and 
“outdoor” should be defined. For example, close-ups of people within a vehicle 
such as a car or a plane, or pictures of people under a roof of a structure with 
no walls, were often labeled differently by different judges. Fig. 1 shows one of 
the images that reasonable people could disagree on; more can be inspected at 
http : / / WWW . cs . Columbia . edu/~sable/unusual . html. 

We have compiled four different sets of images according to these manual 
categorizations. First, we consider the images for which both evaluators provide 
a definite judgement in the same direction on the indoor versus outdoor question. 
This set contains 1,339 images (79.9% of the original 1,675) and is the primary 
focus of our experiments. 401 (29.9%) of these images were classified as indoor 
while 938 (70.1%) were classified as outdoor. 

Our second experimental data set relaxes the requirement of strong beliefs 
from each evaluator. It consists of those images that received two judgements 
in the same direction on the indoor versus outdoor question, regardless of the 
reported degrees of confidence. This set includes 1,501 images (89.6% of the 
original 1,675). 475 (31.6%) of them are classified as indoor while 1,026 (68.4%) 
are classified as outdoor. 

Turning to the number of people question, we define a third set, consisting 
of the images that received identical (non-ambiguous) judgements from both 
evaluators on that question. This set includes 1,346 images (80.4% of the total), 
further divided as 88 (6.5%) with no people, 304 (22.6%) with one person, 213 
(15.8%) with two people, 609 (45.2%) with three or more people, and 132 (9.8%) 
with crowds. We also define a fourth experimental set for studying the interaction 
between the indoor /outdoor and number of people categories, as the intersection 
of the first and third sets described above. This last set contains 1,081 images 
(64.5% of the total). 
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4 Measuring Similarity for Indoor/Outdoor Classification 



We base our classification of images into indoor or outdoor classes on a measure 
of similarity between each document we examine and the two category prototypes 
that correspond to the two classes. The term document is used above with a 
general sense, standing for any piece of text that is associated with the image 
under consideration; in many of our experimental runs, this is much smaller than 
the entire article that contains the image. 

For a single piece of text, a word’s TF, or term frequency, is the number 
of times that this word occurs in that text. For a category (such as all indoor 
images), the TF assigned to a word is the number of times that word occurs in 
all documents of that category. A word’s IDF, or inverse document frequency, is 
the logarithm of the ratio of the total number of documents to the number of 
documents that contain that word; this measure remains constant independently 
of the particular document or category examined. The product TF*IDF, 



TFIDF(word) 



TF{word) X log 



Total number of documents 
DF{word) 



( 1 ) 



is therefore highest when a word contains a balance of high frequency within a 
document or category (signifying high importance to the document or category) 
and low overall dispersion within the collection (signifying high specificity). 

Every document and category is represented by a vector of TF*IDF values, 
with each dimension corresponding to a word. By abstracting content in this 
manner, word vectors of documents and categories can be compared to determine 
how well a document fits in each category. We use the inner product between 
document and category vectors, i.e.. 



Score{document, category) = E X TFIDF,^tegory[i\ (2) 

i 



as our measure of similarity. Each document is then assigned to the category for 
which the fit is best, i.e., for which (2) is maximized. 

We varied this measure of similarity in different experimental runs by using 
different restrictions on what enters the TF*IDF formula (i.e., what a “word” is) 
and by modifying (2) with the introduction of normalizing factors. Our first set 
of parameters, corresponding to the definition of words, involves four choices: 

— Text span considered. What is the text that should be associated with 
each image, becoming the “document” in the TF*IDF calculations above? 
We have experimented by using the entire article, the article without the im- 
age caption, just the caption, or only the first sentence of the caption. While 
the articles are longer and provide more information about the related story 
than the caption, they are less related to the specific image, and therefore 
may contain too much noise to be helpful for the type of categorization we are 
performing. Hence, we can trade some information of questionable quality 
for increased specificity by limiting ourselves to the caption only. Similarly, 
the first sentence of the caption tends to be more descriptive of the image 
than the rest, which often provides background information. 
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Resctriction to specific grammatical categories. Should all the words 
in the selected text span be included in the TF*IDF computations? Open- 
class words (adjectives, nouns, verbs, and adverbs) carry in general most of 
the content information, while words such as numbers and pronouns do not 
usually affect an image’s classification. We used a statistical part-of-speech 
(POS) tagger (Church 1988) to automatically assign a grammatical category 
tag to each word, and then experimented with using all words, only open- 
class words and prepositions (because of the nature of the indoor/outdoor 
distinction), and open-class words and prepositions with proper nouns ex- 
cluded. 

— Disambiguation of words. A word’s sense is frequently ambiguous, and 
sometimes knowing its grammatical part-of-speech can help disambiguate 
it. For example, can is most often an auxiliary verb, but sometimes a noun 
with a different meaning. We experimented with keeping the POS tag as 
part of the word (thus distinguishing between the two senses of can/verh 
and can/noun above), versus ignoring this information. 

— Case sensitivity. Should capitalization matter for treating words as differ- 
ent? Capitalization may indicate a proper noun, but may also be the result of 
sentence-initial placement. We experimented with collapsing words that dif- 
fer only in capitalization to the same token versus treating words as different 
if they differ in case. 

Each combination of the above parameters results in a different set of TF*IDF 
vectors for each document. Three more parameters were varied when calculating 
the similarity between a document and a category: 

Ignoring words with low TF*IDF during similarity computations. 

We have experimented with optionally ignoring words whose TF*IDF val- 
ues within a document fall below a given constant, for several alternative 
values of that constant. This eliminates relatively insignificant words, which 
have minimal impact on the classification, while potentially speeding up the 
necessary calculations and avoiding some rare words whose TF and IDF is 
hard to estimate accurately. 

— Normalization of category vectors. The size of each of the two classes 
does not enter (2) or the TF*IDF calculations. Yet, it is natural to expect 
that the a priori most frequent category will have higher TF values, simply 
because it contains more documents. This is a concern for our experiments, 
since the “outdoor” category contains more than two thirds of the images 
in our collection. We therefore experiment with a modification to (2), where 
the TF*IDF value of each word in a category vector is divided by the total 
number of documents that fall into that category. This modification, which 
replaces total frequency with average per document frequency, makes the 
TF*IDF values directly comparable across categories. 

— Density estimation. The standard approach for assigning documents to 
categories is to select the category for which similarity is largest. This, how- 
ever, implicity assumes that the similarity scores are on the same scale for 
both categories, and makes it hard to tell when a difference between the 
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similarity scores for the two categories is large enough for the system to be 
confident in its decision. We experimented with a modification of the cate- 
gory decision rule by transforming the difference of the raw similarity scores 
between the two categories into the corresponding probability that a docu- 
ment with the given score difference belongs in the indoor category. In other 
words, we empirically estimate the probability density of the composite ran- 
dom variable Score{document, indoor) — Score {document, outdoor). We cal- 
culate the histogram of this difference function from the training part of the 
data (see the next section), and then use a rectangular smoothing window on 
top of the histogram to estimate the probability density (Scott 1992). For a 
new image in the test set, we again compute the difference and apply the con- 
version procedure that was fixed during training. The resulting probability 
is more directly interpretable than the difference of the raw similarity scores, 
automatically adjusts the cut-off point between the two categories (from the 
arbitrary 0 on the unrestricted difference scale to the now well-justified 0.5 
on the 0 to 1 probability scale), and provides a measure of confidence in the 
system’s decision (values near 0 or 1 indicate higher confidence) that can be 
easily combined with information from other independent categorizers. 



5 Results and Evaluation 

We randomly selected 894 (approximately two thirds) of the 1,339 images that 
had definite human agreement on the indoor versus outdoor classification ques- 
tion for training, and the remaining 445 images for testing. 276 (30.9%) of the 
training images were indoor while 618 (69.1%) were outdoor. 125 (28.1%) of the 
testing images were indoor while 320 (71.9%) were outdoor. So, on that partic- 
ular breakdown of our main experimental image set, a default classifier would 
achieve 71.9% accuracy on the test set by labeling every image with the more 
frequent category in the training set. 

Using this training/testing partition, we calculated the TF*IDF vectors and 
similarity scores described in the previous section for each of the 768 possi- 
ble combinations of parameters, performing a complete designed experiment 
(Hicks 1982). The training set was randomly divided into three equal parts, and 
for each such experiment, we repeatedly trained on two parts and measured sys- 
tem performance on the third. This three-fold cross validation on the training set 
gives us the ability to compare the relative performance of the various settings 
for the experimental parameters. It also allows us to select the best combination 
of parameters, which is fixed for subsequent experiments, and in particular for 
scoring against the completely unseen test set. 

We found a wide variety in the obtained average accuracy score (percentage 
of correct categorizations) depending on the parameter settings. The parameters 
which had the most major effect were: 

— Text Span. Restricting analysis to the first sentences of captions accounted 
for the 37 top scoring experiments. First sentences clearly outperformed cap- 
tions, while text spans that included the entire article (with or without the 
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caption) were far behind. This provides support to our thesis that specifi- 
cally selected and narrowly targeted pieces of text can be more useful for 
classifying embedded multimedia information than the document as a whole. 
Restriction to specific grammatical categories. Using only open-class 
words plus prepositions accounted for 4 of the top 5 experiments. The average 
accuracy over all experiments for this setting was also higher that that for 
using all parts of speech, which, in turn, was higher than that using open- 
class words plus prepositions but excluding proper nouns. So it appears that 
proper nouns help in this classification task, a somewhat counter-intuitive 
result, especially since we generally have a high number of low-frequency 
proper nouns. 

— Normalization of category vectors. Normalizing category vectors ac- 
counted for 12 of the top 15 experiments, and had a higher average accuracy 
among all experiments, even more so for cases where density estimates were 
used. 

— Density estimation. Using probability densities instead of raw similarity 
scores improved performance in almost every case, including all combinations 
of parameters ranked near the top. This optional component had one of the 
most pronounced effects in overall system performance. 

On the other hand, ignoring words with low TF*IDF, keeping the part of 
speech information for disambiguation, and ignoring case differences played much 
smaller roles. High thresholds for including words in the TF*IDF vector were 
clearly bad, but other than that, all setting of these parameters were used in some 
of the best experiments, and the average accuracy for each were similar. Table 1 
summarizes the effect of each value of each parameter over all experiments, 
while Table 2 shows the top fifteen combinations of parameters (those which 
achieved over 82.5% accuracy) in terms of performance during the three- fold 
cross validation on the training set. The average cross- validated accuracy of all 
384 experiments that directly use the TF*IDF scores was 71.74%, and of the 384 
experiments that include the probability conversions, 74.26%. Note that these 
overall accuracies are close to the baseline of the default classifier (71.9%), while 
31 of the 768 combinations of parameters performed better than 82% during 
cross validation. This indicates that an informed choice of the parameters is 
important for this classification task. 

On the basis of these cross-validation experiments, we selected the following 
combination of parameters for our system: using the first sentences of captions 
only; restricting words to those of an open class plus prepositions; treating words 
that differ only in part of speech as identical; keeping capitalization information; 
not applying any thresholds for including words in the TF*IDF vector; nor- 
malizing according to category size; and applying the density transformation. 
These are the parameters that were used in the experiment represented by the 
first line of Table 2, which was one of two that tied for the best results during 
cross-validation. With these parameters fixed, we retrained on the full training 
set and tested on the unseen test set. The corresponding categorizer achieved 
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Table 1. Average overall accuracy during cross-validation of all experiments with the 
given value of each parameter. 



Parameter 


Value 


Average Accuracy 


Text Span 


first sentences of captions 


79.45% 


captions 


76.06% 


articles (including captions) 


69.22% 


articles (excluding captions) 


67.26% 


Part of speech restriction 


open-class and prepositions 


73.54% 


all words 


73.09% 


open-class and prepositions, 
exluding proper nouns 


72.36% 


Keeping tags for disambiguation 


yes 


73.08% 


no 


72.91% 


Case sensitivity 


yes 


73.01% 


no 


72.99% 


Threshold on TF*IDF 


medium 


73.63% 


low 


73.57% 


none 


73.21% 


high 


71.57% 


Normalization according to 
category size 


yes 


73.36% 


no 


72.64% 


Using probability density estimates 


yes 


74.26% 


no 


71.74% 



on the test set 82.02% accuracy, and 90.72% on the training set.^ If the density 
estimate transformation were not employed, the accuracy on the tests set falls 
dramatically to 72.36%. Tables 3 and 4 are contingency tables further breaking 
down these accuracy scores on a per category basis, separately for the cases 
where the density adjustments are used or not. Note that the use of probability 
densities tends to shift the system’s categorizations from the smaller category 
to the larger category. Therefore, the smaller category winds up having a higher 
precision and lower recall, while the larger category ends with a lower preci- 
sion and higher recall. Detailed results on our 445 individual test images can be 
observed at http ; //www. cs . Columbia. edu/~ sable/resear ch/demo_results/ 
demo_results . cgi. 

Naturally, we want to compare these results with alternative classifiers, in- 
cluding humans. Our accuracy on the test set (82.02%) clearly surpasses that 
of the default classifier which always selects the “outdoor” label for every image 
(71.9%). We estimate human performance on this task by measuring the per- 
centage of correct classifications achieved by a human volunteer who looked only 
at the captions of the images (i.e., who had access to the same kind of informa- 
tion that our system does). Of the 1,339 images in our main set, 1,172 (87.52%) 

® An indoor output probability of more than 50% is translated to a decision in favor 
of the indoor category during this evaluation. 
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Table 2. Top fifteen combinations of TF*IDF experiment parameters after three- fold 
cross validation on the training set. The “tags” column indicates whether tags were 
kept for disambiguating words; the “case” column indicates whether word comparisons 
were case sensitive; and the “norm.” column indicates whether the normalization for 
category size was applied during the similarity calculations. 



Text span 


Part of speech 
restriction 


Tags 


Case 


Threshold 
on TF*IDF 


Norm. 


Accuracy 

without 

densities 


Accuracy 

with 

densities 


first sentences 
of captions 


open-class plus 
prepositions 


no 


yes 


none 


yes 


75.06% 


83.22% 


first sentences 
of captions 


open-class plus 
prepositions 


no 


yes 


low 


yes 


75.06% 


83.22% 


first sentences 
of captions 


all words 


yes 


no 


medium 


yes 


78.08% 


82.89% 


first sentences 
of captions 


open-class plus 
prepositions 


no 


no 


low 


yes 


74.83% 


82.89% 


first sentences 
of captions 


open-class plus 
prepositions 


no 


no 


none 


yes 


74.61% 


82.89% 


first sentences 
of captions 


all words 


no 


no 


medium 


yes 


79.08% 


82.77% 


first sentences 
of captions 


open-class plus 
prepositions 


no 


yes 


none 


no 


78.75% 


82.77% 


first sentences 
of captions 


all words 


yes 


no 


medium 


no 


78.97% 


82.66% 


first sentences 
of captions 


all words 


no 


yes 


low 


yes 


77.29% 


82.66% 


first sentences 
of captions 


all words 


no 


no 


low 


yes 


76.73% 


82.66% 


first sentences 
of captions 


open-class plus 
prepositions 


yes 


no 


low 


yes 


75.17% 


82.66% 


first sentences 
of captions 


open-class plus 
prepositions 


no 


yes 


medium 


no 


81.99% 


82.55% 


first sentences 
of captions 


all words 


no 


yes 


none 


yes 


77.40% 


82.55% 


first sentences 
of captions 


all words 


no 


no 


none 


yes 


77.07% 


82.55% 


first sentences 
of captions 


all words 


yes 


no 


none 


yes 


76.96% 


82.55% 



were correctly categorized under this access condition.^ This figure can serve as 
a reasonable, approximate upper bound for how well we might hope our system 
to perform given only text information. 

Recently, an image-based approach for classifying photographs as indoor or 
outdoor has been proposed (Szummer and Picard 1998). This approach is based 
on a decomposition of the image by applying a 4 x 4 grid on it and taking 

For this purpose, any categorization in the right direction (i.e., indoor or outdoor), 
regardless of the degree of confidence, was considered correct while assignments of 
the “Ambiguous” label received half credit. 
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Table 3. Contingency table showing the breakdown of the system’s categorizations on 
the test set with conversions to probability densities. 





Actual Indoor 


Actual Outdoor 


Precision 


System Indoor 


75 


30 


71.43% 


System Outdoor 


50 


290 


85.29% 


Recall 


60.00% 


90.63% 





Table 4. Contingency table showing the breakdown of the system’s categorizations on 
the test set using the raw similarity scores. 





Actual Indoor 


Actual Outdoor 


Precision 


System Indoor 


106 


104 


40.48% 


System Outdoor 


19 


216 


91.91% 


Recall 


84.80% 


67.50% 





measures of low-level image features such as color and texture on each of the 16 
image regions. Then, similarities between blocks in a given image and blocks in 
known indoor and outdoor images are calculated, and the image is assigned to 
one of the two categories. In cooperation with image processing researchers at 
Columbia,^ we reimplemented this technique and measured its performance on 
our collection of photographs. We found that its accuracy on our test set was 
74%, significantly less than what we obtain with our text-based methods. We 
also added supplemental low-level features, such as edge direction histograms, 
to those used by Szummer and Picard, and a machine learning component for 
estimating classification thresholds. The resulting classifier (Paek et al. 1999) 
achieves 76% performance, still less than the method described in this paper. 

For each of the above comparisons, we calculated a level of significance by 
applying Pearson’s chi-square test (Fleiss 1981) on the contingency table that 
represents the cross-classification of the answers of the two compared meth- 
ods.® We observe that the difference between the performance of our system 
and either the default baseline, Szummer’s and Picard’s image-based classifier, 
or (regrettably) the human judges, is strongly significant at the 1% level or less; 
the probability that similar or more pronounced differences in the observed ac- 
curacy rates between the compared methods would be observed by chance is 
0.046%, 0.464%, and 0.460%, respectively. When comparing our system to our 
enhanced image-based model (Paek et al. 1999), the difference is still significant 
at the 5% level (P- value of 3.24%). 



® Seungyup Paek, Alejandro Jaimes, and Shih-Fu Chang, of the Department of Elec- 
trical Engineering, Columbia University. 

® The large-sample assumption of the chi-square test is satisfied for these contin- 
gency tables. Because we test on several hundreds of images, the exact Fisher test 
(Fisher 1934) is computationally impractical. 
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Table 5. System accuracy stratified according to high, medium, or low confidence. 



Confidence Level 


Number Correct 


Number Incorrect 


Accuracy 


p > 0.9 or p < 0.1 


234 


21 


91.76% 


0.7 < p < 0.9 or 0.1 < p < 0.3 


89 


32 


73.55% 


0.3 < p < 0.7 


42 


27 


60.87% 


Total 


365 


80 


82.02% 



A final evaluation question is how reliable the confidence estimates provided 
by our system’s output probabilities are. Preferably, decisions with a high degree 
of confidence should be more likely to be accurate than decisions given a low 
degree of confidence. We have therefore broken down the test set into three 
subsets according to the probability assigned by our system, p, that a given 
image is indoor. These three ranges of p were defined as high confidence (p > 0.9 
or p < 0.1), medium confidence (0.7 < p < 0.9 or 0.1 < p < 0.3), and low 
confidence (0.3 < p < 0.7). Note that the indoor probability equals 1 minus 
the outdoor probability, with the classifier selecting the indoor category when 
p > 0.5 and the outdoor category otherwise; hence, probabilities ofp and 1— p are 
equivalent in terms of the expressed confidence. Table 5 shows the accuracy of our 
system within each confidence category, and verifies that decisions given a higher 
level of confidence are more likely to be correct, thus validating our confidence 
estimates. In particular, 255 (57.3%) of the 445 test images were labeled with 
over 90% confidence, and 91.76% of these categorizations were correct. 



6 Identifying Words with High Discriminating Power 

Methodology. A second approach to the classification problem is to automatically 
locate words (or multi-word phrases) whose presence strongly indicates one of 
the competing classes. We explore this technique by first extracting all open- 
class words plus prepositions from the first sentences of captions. We exclude 
proper nouns from this analysis since they are unlikely to be general indicators 
of one of the categories, and only consider words occurring five times or more 
in our training set. This last step is done to ensure that the words we keep will 
be frequent enough to be general discriminators, and to avoid cases where a 
particular word occurs in a few captions of images from a particular class simply 
by chance. 

We construct a log-linear regression model (Santner and Duffy 1989) using 
binary variables corresponding to the occurrence of each of these words as predic- 
tors and the output feature (e.g., indoor or outdoor image) as the response. The 
model is fitted with iterative reweighted least squares (Bates and Watts 1988), 
and the fit assigns a weight to each of the candidate discriminators. Words with 
higher weights are those that actually help discriminate between the two classes. 
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As an alternative machine learning technique, we also consider decision trees 
(Quinlan 1986). The prediction model remains the same, but now the tree is con- 
structed with recursive partitioning, with the most discriminating variable be- 
ing selected first. The resulting tree is shrunk (Hastie and Pregibon 1990) (node 
probabilities are optimally regressed to their parents) to reduce the possibility of 
overfitting; we select the shrinking parameter a through cross-validation within 
the training set. 

Results. Using the same training/test set division as with the TF*IDF exper- 
iments reported in the previous section, our list of candidate discriminators 
contains 665 words. Both the log-linear regression model and the tree select a 
subset of these words as classification features; in the case of the selected tree, 
80 words are used during classification. 

It is interesting to note which these words are, especially since the results of 
this procedure are likely to generalize to other sets of images. The five words 
most favoring an indoor classification are conference, meeting, meets, hands 
(plural noun), and L, while the five words most strongly indicating an outdoor 
image are of, from, soldiers, police, and demonstration. Some of them are 
expected (e.g., demonstration or police for an outdoor image, or conference for 
an indoor one), but some come as a surprise, for example, the “words” C, L, and 
R (indicating an indoor image) used in parentheses to identify people in images 
by position (i.e., center, left, or right). 

Overall performance of the word discriminant method was 93.62% over the 
training set and 78.65% over the test set. 

Integrating the two classifiers. The two classifiers discussed in the present and 
the previous section utilize different approaches to arrive at similar classification 
performance. Hence, it is natural to investigate how correlated their answers 
are, and whether a combined classifier might improve overall performance on 
the indoor/outdoor classification task. 

We have built such combined classifiers using both general machine learning 
techniques discussed above (log-linear models and decision trees). However, the 
overall performance of the composite classifiers was in both cases slightly less 
than that obtained by the best individual classifier (82.02%). We attribute this to 
overtraining during the construction of the composite classifiers, especially since 
the same training set was used for building each of them and for combining 
them.^ Nevertheless, our implementation of two classification methods provides 
us with two different general tools that can be easily ported to other high-level 
classification tasks; and the ability to identify key discriminating words may 
prove helpful in future exploration of what makes images in distinct categories 
different. 



^ A further subdivision of our image data in two separate training sets and a test set 
would leave us with too few images in each set. 
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7 An Alternative Evaluation Metric 

So far, all reported accuracies considered the system to be completely right 
if the category with the higher probability was correct and completely wrong 
otherwise. An alternative evaluation method is to take the probability assigned 
by the system to the correct category and consider that to be the system’s 
score for that document, in a manner similar to the partial credits proposed in 
(Hatzivassiloglou and McKeown 1993). For example, let’s say that the system 
analyzes an image and says the probability that the image is indoor is 65% 
(meaning that the probability that the image is outdoor is 35%). If the image 
is actually indoor, the system is given a score of 0.65 for this image, while if 
the image is actually outdoor, the system is given a score of 0.35. The overall 
accuracy of the system is then the sum of the system’s scores for all images 
divided by the total number of images. In the ideal case, the system would assign 
all indoor images a probability of 1 of being indoor, and all outdoor images a 
probability of 0 of being indoor. Thus its total overall accuracy would be 1, or 
100%. Indeed, if the system always has complete confidence in its decisions, the 
revised evaluation method becomes equivalent to the standard one. 

In this way, the system receives partial credit for each answer, more if the 
system leans in the correct direction and directly increasing as the system’s 
confidence in a correct decision increases. In general, when a system already 
classifies most images correctly under the original 0/1 scoring method, it will 
tend to be penalized for its uncertainty on correct decisions more than it is 
credited for uncertain wrong answers. This is the case in our task when our 
classifier is evaluated on our main set of images (those with definite agreement by 
the human volunteers); the system achieves 82.02% accuracy under the original 
evaluation method, and 76.71% under the revised one. However, we consider this 
modified method as more revealing, as it offers a way to evaluate the system’s 
confidence in its decisions. 

To further illustrate this alternative evaluation technique, and also the gen- 
erality of our parameter selection mechanism, we repeated our training of the 
indoor/outdoor classifier on our second set of images, those that had any kind 
of agreement from the human judges (not necessarily with strong beliefs; see 
Sect. 3). We randomly selected 1,000 (approximately two thirds) of the 1,501 
images in that set for training and the remaining 501 images for testing, and 
retrained the classifier using the optimal combination of parameters determined 
in Sect. 5. 308 (30.8%) of the training images were indoor while 692 (69.2%) were 
outdoor; within the test set, 167 (33.3%) images were indoor while 334 (66.7%) 
were outdoor. Our system achieved on the test set 77.05% accuracy using the 
raw TF*IDF similarities and 80.04% after converting those to probability esti- 
mates. The latter of these results is the most important, and it is 1.98% lower 
than the result from the main set with definite agreement. This makes sense, 
since manual categorizations with a lower degree of confidence are less likely to 
be accurate, and also may indicate images that are inherently harder to classify. 
This is in fact reflected in the system’s confidence measure, which tends to be 
lower on these problematic cases; applying the alternative evaluation method to 
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Table 6. Breakdown of the set of images with dehnite agreement on indoor /outdoor 
and number of people features into indoor and outdoor images for each value of the 
number of people feature. 



Number of people 


Indoor images 


Outdoor images 


Percentage of indoor images 


No People 


2 


75 


2.6% 


One Person 


122 


108 


53.0% 


Two People 


75 


85 


46.9% 


Three or More People 


155 


332 


31.8% 


Crowd 


8 


119 


6.3% 


Total 


362 


719 


33.5% 



this second test set, we obtain overall accuracy of 76.56%, almost as high as that 
measure is for the first test set. 

8 Using Information about Number of People 

Earlier on, we noted that our goal in this line of research is to develop multiple 
classifiers for a number of broadly applicable classification features. It is natural 
to consider interactions between such classifiers, as information about one feature 
may well help the categorization according to another feature. In this section, 
we report on investigations regarding the effect knowledge about the number 
of people in a photograph has on our ability to classify the image as indoor or 
outdoor. 

We have not yet built a text-based classifier for this second feature,® so we 
use instead ideal knowledge, provided by the human categorization of images 
according to this feature. We analyze the set of images that has both definite 
agreement between the human judges in the indoor/outdoor question and agree- 
ment in the number of people question (excluding ambiguous labels). This set 
contains 1,081 images, 362 (33.5%) of which are indoor and 719 (66.5%) are 
outdoor, a similar distribution as in the larger set which we used for our main 
experiments. However, if we take the number of people as given, the distribution 
of indoor versus outdoor images within each category of the secondary feature 
changes, sometimes dramatically, as Table 6 shows. 

To utilize this information, we need a formula that connects f{I\c,d), the 
probability density of an image being indoor given that it belongs to category 
c according to the number of people feature and that it receives a similarity 
difference of d, to our old probability density estimates, f{I\d). Unfortunately, 
a Bayesian expansion of f{I\c,d) involves the joint density f{c,d), which we 
cannot estimate without a classifier that predicts the number of people c from 
the difference d (or vice versa). Therefore, we intuitively derive an approximate 



Although work is under way for building one based on face detection combined with 
name extraction from captions. 
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formula for /(/|c, d) as follows: Given N images with similarity difference in a 
small neighborhood Ad around d, approximately P{I\Ad) ■ N of them will be 
indoor. Now, for any image that has a specific number of people c, its odds 
for being indoor will change (for better or worse) from the global proportion of 
indoor images P{I) by the ratio P{I\c) / P{I). If P{c) is the global proportion 
of images with c people in them, the overall number of indoor images with c 
people among the initial N images with similarity difference close to d can be 
estimated as 

iV(/|c, Ad) « P{c) ■ • P{I\Ad) ■ N (3) 

Similarly, the overall number of outdoor images with c people among the 
same N images can be estimated as 

N{0\c, Ad) « P{c) ■ • (1 - Pmd)) ■ N (4) 



By combining (3) and (4), we get our formula for updating P(I\Ad): 



P{I\c,Ad) 



N{I\c, Ad) 

N{I\c, Ad)+N{0\c, Ad) 



P{I\c) 

P{I) 



P{I\c) 

P{I) 



■ P{I\Ad) 



■ P{I\Ad) + ■ (1 - 



(5) 



We applied this update formula to the images in the set with definite agree- 
ment on both the indoor/outdoor and number of people questions. Since that 
set is a subset of our main experimental image set, we took those images that 
were in the training set for the main set (see Sect. 5) as our training images, 
and the remaining as test images. The resulting training set had 732 images, 
of which 249 (34.0%) were indoor, and the testing set contained 349 images, of 
which 113 (32.4%) were indoor. If the methods of Sect. 5 are applied to this 
training/test set partition while ignoring the number of people information, we 
obtain 79.94% accuracy on the test set. If instead we assume perfect knowledge 
of the number of people variable and update the probability estimates by ap- 
plying (5) (estimating quantitities such as P{I) and P{I\c) from the training 
set), we obtain 80.23% accuracy on the test set. This is only a minor improve- 
ment, not statistically significant. However, if the alternative evaluation metric 
of the previous section is employed, accuracy improves from 74.96% to 77.19%. 
So while few categorizations actually changed from wrong to right or vice versa, 
the system’s confidence values in its decisions were more appropriate when the 
number of people was taken into account. In other words, on average, correct 
decisions were given higher confidence while the reverse happened to incorrect 
decisions. 
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9 Conclusions and Future Work 

We have shown that our methods for categorization of images as indoor or out- 
door strongly beat baseline performance and competing, image-based techniques, 
and even begin to approach human performance. In fact, our system provides 
93.72% of the correct answers that a human judge with access to the same kind 
of information does (82.02% versus 87.52% overall accuracy). By staying within 
the TF*IDF paradigm but experimenting with several parameters and adding 
the use of probability density estimates, we have created a system that achieves 
82% accuracy on unseen images. The output of our system is in terms of a 
probability, which is readily interpretable and provides a level of confidence in 
the system’s decision. We have explored additional techniques both for image 
classification and for evaluating the constructed classifiers. In addition, we inves- 
tigated the possibility of using additional information about images that might 
change a priori probabilities of an image being indoor or outdoor, and there is 
some promise that the system’s results may be improved. Our methods are gen- 
eral, and could be applied to other high-level visual features, although currently 
our model of probability densities assumes dichotomous classifications. 

We have examined a classification approach that relates to the Rocchio 
paradigm (Rocchio 1971) and combines TF*IDF estimates with a probabilistic 
normalization. A future alternative is to compare our results with pure proba- 
bilistic approaches such as naive Bayes (Lewis 1998) and connectionist models 
(Lewis et al. 1996). Certainly, we have not exhausted the space of possible fea- 
tures and transformations of the input data; we plan to examine additional such 
options, including morphological transformations/stemming, semantic informa- 
tion linking related words, and different weighing of identified named entities. 

Our immediate next step is to integrate this text-based classifier with image- 
based ones that are being developed at Columbia, and expand the range of 
classification questions considered. We will explore high-level classifications such 
as indoor/outdoor, number of people, and city versus landscape, and comple- 
ment the general classifiers with specific image feature detectors (e.g., detectors 
of skies, handshakes, or faces) . Our goal is to provide a hierarchy of such classi- 
fiers and analyze their interactions so that we can build a model that relates a 
combination of the high-level visual features to specific conditions under which 
an image is appropriate for inclusion in a multimedia document. 
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Abstract. This paper describes the production of an educational mul- 
timedia CD-ROM about French rural houses and farms, and how to 
renovate them without losing their traditional features. The educational 
message is illustrated with many photographs of non-renovated or ren- 
ovated houses, and made explicit through comments and descriptions 
associated with the photos. The paper focuses on the XML metadata 
describing the photos and the use of this metadata for the automatic 
generation of Web pages. We hrst report on the usability of the Dublin 
Core for interoperable photographs metadata, together with more de- 
tailed XML descriptions to support a specific multimedia application. 
We then show how to generate the Web pages by defining HTML doc- 
ument prescriptions which embed queries to the XML metadata, using 
Norfolk, a virtual document generator. The approach can be used in var- 
ious applications ranging from personal virtual photo albums to complex 
virtual museum. 



1 Introduction 

Digital Libraries provide access to an increasing variety of digital multimedia 
information, including images, photos, sound and video. For many years Libraries 
and Digital Libraries have been relying on classification and indexing schemas 
for supporting retrieval of bibliographic references and documents, while Web 
search engines use a mix of full text indexing and classification. Searching for 
non textual material like photos, video and sounds^ requires indexing through 
related textual descriptions, or metadata generally in an external document, 
but possibly encoded within multimedia formats such as JPEG. Metadata, or 
information about data, have motivated a lot of international interest and effort 

^ Searching by similarity with a given photos can be done using internal low level 
characteristics of the photo. We are not addressing this kind of search here. 
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in the last three years, through the Dublin Core Workshop series [5], or the 
Resource Description Framework (RDF) defined by the W3C. 

The Dublin Core has attracted the attention of formal resource description 
communities such as museums, libraries, government agencies, and commercial 
organisations, as illustrated by the many projects listed on the DC site. How- 
ever there is very little report on the use of DC metadata for photographs. 
Before the availability of digital photographs, some effort has been spent on cat- 
aloguing photographs although they are not very specific cataloguing tools for 
photographic material (AAT Art and Architecture Thesaurus, LC thesaurus for 
Graphic Material), nor appropriate standards. As speculated in [9], archivists for 
photos and graphic collections may have considered their collections as unique 
and assumed that no imposed standard could work on their collection. This advo- 
cates once more for an infrastructure such as RDF that supports the coexistence 
of complementary, independently maintained metadata packages. 

Metadata have been primarily designed to support the retrieval and display 
of resources. Digital Libraries of the future will not just display raw material 
stored in their databases, nor will they deliver it in a one-for-all presentation. 
As claimed in [13], ’’Searching is not enough”. Information will be accessed 
through user queries, browsing, guided tours, temporal exhibitions, educational 
courses, etc. The same resource will be used in different contexts, associated with 
other relevant material. Photographs, for example, will be pulled out with other 
documents which are associated with them either statically or dynamically. All 
this information needs to be ready for use during frequent context shifts. The 
material collected for a particular task needs to be pulled together and kept for 
near or long-term use in new information artefacts. Users as well as librarians and 
curators need facilities to construct information compounds or to build flexible 
presentations of Digital Libraries resources. 

This paper presents an example of a simple, yet significant, multimedia ap- 
plication (ultimately a CD-ROM) that is automatically generated from XML 
Dublin Core metadata describing photos. The DC Description element has been 
enriched with a specific schema to support the target application. A more com- 
plex application may require to separate Dublin Core metadata from other XML 
descriptions and complex documents, but the principle would be the same: the 
application is generated from page prescriptions which define the page model 
and the queries to retrieve the appropriate elements (photos, texts) from the 
XML metadata. Links to other pages can also be modelled, and result in an 
hypertext which is coherent and easy to modify. 

The rest of the paper is organised as follows. First we present our application: 
a CD-ROM about rural housing. We then discuss two problems associated with 
the production of this CD-ROM: how to express metadata about photographs 
using Dublin Core, and how to automate the generation of pages with virtual 
documents using the Norfolk system. The last section will outline some benefits 
of the approach and will draw some conclusions and further work. 
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2 The application 

We set out to produce a CD-ROM on rural habitations in a small area near Paris, 
France, where renovation and modern housing have been developing rapidly in 
the last thirty years, endangering the traditional aspect of French villages to be 
completely ruined. The aim of the CD-ROM is to make people more aware of 
this rural cultural heritage and to provide guidance for renovating houses while 
preserving their traditional features. 

About 300 photographs from the various village had been taken in the area to 
illustrate and support the discourse. They had been assembled into a photo ex- 
hibition which toured in several villages during 1992-1993. Following the success 
of this exhibition we thought it was worthwhile to make this material available 
on a more permanent and accessible support such as a CD-ROM. Like the ex- 
hibition, the multimedia presentation will use mostly photos for supporting the 
educative message through illustrative and comparative examples. About 150 
photos have already been scanned for use in the multimedia presentation. The 
main advantage of an hypermedia presentation over a traditional exhibition is 
that some photos can be reused several times, in different contexts, for illus- 
trating different aspects of rural buildings. For example a photo of a nice and 
well renovated house could be used as an example for roof, windows or facade 
renovation, associated with the relevant and focused comment. 

The CD-ROM, as a more perennial support, needs to provide also detailed 
information about the buildings, so that the photos may become part of the 
village archives. Overall information will include: 

— factual information (full address, date of the photograph, state of the build- 
ing, etc.) 

— comments and description about the main interesting features 

— comments and appreciation about the quality of the renovation (this is some- 
how tricky to avoid litigation). 

— comparisons of the same building over time (when photographs or postcards 
are available) 

The user can navigate the CD-ROM using a Web browser. Figure 1. shows 
the navigational structure. Navigation can be guided by village/street, theme 
(roof, windows, etc.), or (not shown in the figure) by examples of renovation 
(good ones as well as bad ones, on different aspects) or comparison before and 
after renovation (when available). The description page for a photograph can 
depend on its context: for example, the description will be different if the user 
is looking at ’’roofs” or ’’windows”. 

Examples of photos can be seen in Figure 2. 

The digital photos have been processed by a photo laboratory from the colour 
negatives and put on a read-only CD-ROM. We thought initially of using the 
KODAK-CD standard to keep the best definition of the photos as recommended 
in [8]. A KODAK Photo-CD can store a maximum of 660Mb of photos (about 
100 photos). As we intend to store at least 150 photos, and ultimately much 
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Fig. 2. Example of photos for house fronts 
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more, it would have been too costly for the budget of the local association who 
is producing the final CD-ROM(s). Since the application was intended mostly 
for public education, working images were sufficient for our purpose. We there- 
fore decided to store digital photos as 24-bit images compressed in JPEG, of 
dimension of 800x537 pixels. It results in image files of size ranging between 
42k and 300k, which would allows us to store millions of images on one photo 
CD-ROM. From these images we produced thumbnail images (200x 134), also 
called browsing images in [8], by reducing the images by 25%, which results in 
image files of size between 8 and 35k. If smaller images were necessary for an 
application (iconic images) there will be produced at display time by setting the 
attributes in the HTML IMG tags. 

3 Metadata for photographs 

Although photographs may be more explicit than a long discourse for an hu- 
mans, they don’t describe themselves in term of content as texts do. For texts, 
authors use many clues to indicate what they are talking about: titles, abstract, 
keywords, etc. which may be used for automatic cataloguing. 

Searching for photos must rely on manual cataloguing, or relate texts and 
documents that come with the photos. Creating standard metadata for pho- 
tograph content is difficult since one photo may have very different meanings, 
depending on the context, of whether it is part of a collection, or even on the 
purpose of a particular end user. Our collection was very specific, and there was 
no standard for describing this kind of buildings. At least the Dublin Core offers 
a framework for describing digital resources and we wanted to experience how it 
would work for our photos. Our main motivation for using a standard like Dublin 
Core was to make our photos reusable for other experiments or applications, by 
ourselves or by other groups. 

For example one could think of creating a CD-ROM by village and compare 
how a house looks today with how it looked in 1900 by comparison with the 
many available postcards. 

3.1 Dublin Core 

The Dublin Core metadata set [5] is intended to promote and develop the meta- 
data elements required to facilitate the discovery of resources (documents and 
images) in a networked environment such as the Internet and support inter- 
operability amongst heterogeneous metadata systems. Dublin Core metadata 
set consists of fifteen descriptive elements. Commonly understood semantics for 
these element is described in the reference description [7] which has been largely 
stable since 1996, and known as DC 1.0. Although the first aim of Dublin Core 
is to offer a minimal set of core elements to achieve interoperability, extensibility 
has been an important issue since the second DC workshop. Early experience 
with Dublin Core deployment has made clear the need to support qualification of 
elements for some applications. Thus, a Dublin Core element may be expressed 
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without qualification or with qualifiers that refine its semantics. Three qualifiers 
were defined: sub-elements, language and scheme. A sub-element specifies a facet 
of a given element. The intent is to narrow the semantics of a field, not to extend 
it. The dot-encoding convention used for this qualifier reflects this. The language 
qualifier indicates the language used for the content of the element. Finally a 
scheme qualifier specifies a context for the interpretation of a given element. 
Typically this will be a reference to an externally-defined scheme or accepted 
standard. Dublin Core defines scheme as a qualifier that ’’provides a processing 
hint that may be used by an application or a person to make better use of the 
element that is qualified” . 

We expressed DC metadata in XML as shown in the example given in Ap- 
pendix 1. A more formal description using RDF and following the guidance on 
expressing Dublin Core within RDF [12] can be read in Appendix 2. In the next 
section we describe in detail our use of Dublin Core and the few qualifiers we 
added (sub-elements and schemes). Our schemes are XML schemas and values 
for the corresponding elements are therefore expressed also in XML. 

3.2 Report on our use of DC 

Although our collection of photographs is rather modest in size, our use of DC 
has given rise to a number of small difficulties which are worth reporting on. As 
mentioned before, very little report on using DC for images and photographs is 
available today. 

We describe below for every DC element, its use in our application. 

Title: What is the title of a photo is not very well defined. It could be seen 
as the caption of an image, but it is generally dependant of the context. 
Think of the kind of legends you may write in your family photo album. It 
may be very informative, or just humorous. In our case we put a very general 
description that may be used in various contexts, and of some use for general 
retrieval. 

Author: The person who has taken the photograph. 

Coutributor: The contributor could be the lab or person who developed the 
photo, or scanned it. Each contributor could have a role specified with a 
sub-element qualifier. Here we just mentioned the photo shop who scanned 
the photos. 

Subject: It should be a list of keywords, in a standard terminology for buildings 
and architecture. Since rural habitat has not yet attracted much interest 
as cultural heritage, there is no classification schema specifically relevant 
to these buildings. So we use common words (i.e. windows, roof, facade). 
Subject element is more intended for interoperability with other applications 
and we are not using it to query the metadata. Instead we use the sub- 
elements of the Description element since the photo will be retrieved only if 
there is a specific description for this subject. 

Description: Each photo has two descriptions: a general one usable in the 
presentation by village, and a more detailed using a scheme qualifier specific 
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to the application. Our schema contains XML elements for the parts of the 
house that are being described (roof, roof windows, windows, doors, coating, 
front, etc..). It is also used to describe the general state of the building: 
renovated, poor condition, etc. 

Publisher: The publisher of the CD-ROM. There is a problem here if we want 
to make the material reusable: the publisher would depend on the specific 
application the photo is used in. In our case it was not clear what to put in. 
We chose to mention the Association who supports the creation of the CD- 
ROM, although the photos (and their metadata) could be reused in another 
context. 

Date: The date when the photograph was taken (not scanned). At this point 
it seems that Dublin Core requires to use Date for the date of the digital 
resource, and possibly source. date for the date of the original photo. For us it 
was typical of the problem that arises with photos. We would rather consider 
the photo as the resource, and the various digital images as surrogates for 
this photo. We will come back to this point with Format. For our application 
what is important is the date of the original photo which marks the state of 
the building at a determined period. 

Type: The type of the resource which is just an ’’image” in term of the DC 
recommended Type categories. We also mentionned whether it was a colour 
or black and white photo, in a second type element. 

Type. quality: The quality of the photo. There is no specific element in DC 
to speak about the intrinsic quality of an image or a photo. Although it 
may be somehow subjective, it is an important information when the user or 
the application want to make choice between equivalent photos to display. 
Puting it as a sub-element is certainely not conform to DC requirment for 
qualifier, but there was no appropriate place to add this information, using 
Dublin Core. 

Format: The format of the digitised image (MIME/ JPG). 

Format. orientation: Whether the photo must be seen horizontally, or ver- 
tically. This is an important information when selecting photos for multime- 
dia presentation, since vertical photos can be grouped differently as horizon- 
tal ones to fit in an usual screen size. Format. full and Format. small are 
referering to the existence of two surrogate formats for the photos. Again 
there was a difficulty here in describing various format/size for the same 
photo. 

Identifier: a reference to the actual photo. We could not use an URL since the 
photos will move to a CD-ROM, with a different hierarchy. We use an URI 
that reflects the relative organisation of the application. 

Source: A reference to the original photo, or its negative; the value of the 
reference will strongly depend on where the original is archived, whether 
it is a personal collection or an Archive Department. A specific SCHEME 
could be useful here. 

Language: Not applicable for photographs. 

Relation: Relation to the overall collection or to another photo for comparison. 
Relation could in principle be used for reference to surrogates of the same 
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photo, but we did not like this solution which will make seaching for a photo 
a very complex process. 

Coverage: The date and location of the object in the photo (i.e the building). 
Coverage. date: The date of the object being photographed We did not 
use this element since we had little information on the construction date of 
the buildings. Coverage. location: The precise address of the building since 
the photo could serve as a future archive for the villages. We introduced an 
address scheme (number, street, town, country), to be able to retrieve and 
sort out the photos by village, street, or house. 

Rights: The copyrights for the photograph. There is some ambiguity between 
the rights for the source and for the digitised photograph. We mean here the 
source version, and by extension any digitised version of the photograph. 
Copyrights may be much more complex. 

In resume the difficulty with metadata for photographs is to make the dis- 
tinction between the information about the photo and the information about 
what is represented by the photo. Retrieval of the photo will be mostly done 
from the latter while rendering of the photo has nothing to do with its content. 
The existence of various ’’versions” or digital surrogates from the original photo 
makes it more confused, as reflected in the use of the Format element. These 
problems have been already identified in [18]. 

3.3 Metadata for describing the scanning process 

[2] makes strong requirements on what information should be kept on the format 
of a digitised image and on the scanning process. This information is mostly 
irrecoverable if not recorded at the time of the capture. In large archiving projects 
this would be a major requirement, while less critical for casual user or small 
applications such a CD-ROM. These information may be important to be able 
to display long term archive images in many years, or to create large photograph 
workbenchs. In our case we did not store any information about the scanning 
process other than the name of the laboratory who did the work. We feel it was 
not necessary in regard to the type of photos and precision of the digitalised 
images. 



3.4 Application specific metadata 

As said before, metadata specific to the application were introduced as extra 
tags in the various Description elements, using a Scheme qualifier. For example, 
the description of a ’’window” is expressed as follows: 

<DC: Description SCHEME ="habitat"> 

<rural> <window example="good"> 

The windows are heigther than large (1/3), with 6 glasses. 
Note the irregularity of the window disposal, with a large 
plain area on the right . 
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</window> 

</rural> 

</DC :Description> 

For more complex applications this information could be stored in a separated 
XML document, with a possibly complex DTD. For simplicity, we choosed to 
store it in repeated Description elements. 

4 Virtual Documents 

Our CD-ROM will have a large number of pages, even for only 150 photographs 
since each photo may correspond to several presentations depending on the con- 
text where it is displayed. Moreover we intend to scan and add more photos in 
the future and design more personalised versions of the CD-ROM for didactic 
usage in a specific village. Clearly we need a tool to automate the creation of 
pages, which would allow to select photos and group them by themes, villages, 
etc. 

Our solution is to generate pages as virtual documents, i.e. to write HTML 
’’templates” with normal tags and text and with instructions on how to generate 
dynamic information extracted from the photograph collection. In this section 
we introduce Norfolk, our virtual document generator, give some examples 
of document prescriptions to generate our CD-ROM, and describe the set of 
prescriptions. We conclude with a discussion about dynamic generation. 



4.1 The Norfolk system 

Norfolk is a system for the generation of virtual documents. Initially developed 
to promote the idea of information reuse, it has been used for Web delivery 
(see [14]) to see how it is used to generate and maintain the TED Web Site 
(at http://www.cmis.csiro.au/TIM/), and more recently for customised infor- 
mation delivery, i.e. how to select, modify or generate the information to satisfy 
a particular user need. 

In Norfolk, the instructions to generate a virtual document have a descriptive 
rather than a programmatic style, and, more importantly, a common data model 
is defined to view information coming from a variety of sources as a document- 
like structure. Information sources can still be queried in their native format 
(e.g. SQL for a relational database), but results are then ’’translated” to a tree 
structure similar to a document parse tree. A query language allows selection 
or modification of the tree, and supports useful operations for document struc- 
tures. For example it supports extraction of semi- structured information, i.e. 
information whose exact structure is not known or can vary. For more informa- 
tion on Norfolk data model and language, see [16]. An online demonstration is 
also available at http://www.ted.cmis.csiro.au/RIO. 

Figure 3 shows how Norfolk is used to generate a set of pages. The instructions 
to generate the virtual documents are stored in document prescriptions. The em 
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interpreter takes a document prescription as input, queries data sources such 
as databases or other pages, and produces a virtual document as output. The 
’’children” or document prescriptions referenced in the virtual document, are 
then recursively generated. In other words, the generator ’’follows” the links to 
document prescriptions in order to generate a network of virtual documents. 
Cycles and naming for the virtual documents are automatically dealt with. 




Fig. 3. Generation of pages using Norfolk 



4.2 Examples of prescriptions 

The following example shows part of a prescription for collecting all the pho- 
tographs related to one village (passed as a parameter) and to display the list 
of those photographs, with the address of the places and links to the actual 
photographs. 

The document prescription is a valid HTML document, to which Norfolk 
instructions are added in the form of process instructions (i.e. <? ...>). The 
’’Part 1” finds the photographs for the village. It first gets the list of all metadata 
files (an index file, DIR . xml, which is dynamically generated, has an <f > element 
for every file) and then iterates through the list, loading the file and checking if 
it is in the right village. In practice, for efficiency reasons, the list of photographs 
for every village could be built beforehand, and included in DIR. xml. ’’Part 2” 
then outputs the photographs in a list, with the address and a link to the image 
for every photograph. 

The following prescription produces a list of four villages, with a link to 
their respective page. The href attribute, created with the attr command, is 
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<! — Takes one parameter: the village name — ></font> 

<?param "village" as $village> 

<! — Output title (village name) — ></font> 

<H1> <?$village> </Hl> 

<! — === PART 1: find photos for village === — > 

<! — Assign all filenames to $photos — ></font> 

<?define $photos as urlC'DIR.xml") . .f> 

<! — build SphotosVillage (photos for a village) 

by iterating through the list of all photos — ></font> 

<?map $xx in $photos> 

<?begin> 

<?define $a_photo as urKstr ("photos-XML/" , $xx) ) > 

<?if $a_photo . .town contains $village> 

<?begin> 

<?define $photosVillage as $photosVillage adopt $a_photo> 
<?end> 

<?end> 

<! — === PART 2: output list of photos === — ></font> 

<ul> 

<?map $photo in $photosVillage> 

<li> <?$photo. .Address .number ; $photo . . Address . street> 

<A> <?attributes attrC'href " ,str(" . ./",$photo. .DC: IDENTIFIER) )> 
<?$photo . .DC : S0URCE> </A> 

</li> 

</ul> 



Fig. 4. Document prescription for a Village 
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a link to the village prescription. This link consists of the prescription name, 
village . rhtml, and the village name: for example, the urlstr command will 
produce village . rhtml?village=Chavenay). The attributes instruction is 
then used to attach the href attribute to the preceding tag, <A> . </A> 



<html> 

<! — output list of links to each village page — ></font> 

<ul> 

<?map $x in listC'Les Alluets-le-Roi" , "Chavenay" , 

"Ecquevilly" , "Noisy-Le-Roi")> 

<li> <A> 

<?attributes attr ("href " , urlstr ("village . rhtml" , "village" ,$x) ) > 
<?$x> </A> 

</li> 

</ul> 

</html> 



Fig. 5. Document prescription for a List Villages 



When Norfolk generates the List of Villages page, it follows the Village pre- 
scription links (e.g. village . rhtml?village=Chavenay) to generate the Village 
pages as well. 

4.3 The set of prescriptions 

— model 1: set of photos organised by village (village. rhtml) 

— model 2: a page by (photo, context = village) 

— model 3: set of photos organised by themes; each theme may have its own 
model. 

— model 4: page by (photo, context = theme) (for the relevant photos) 

— model 5: full size photo 

Each model has to be designed carefully with possible variants. For example 
a page according to model 1 could be organised by streets if there are many 
photos for a village, or could display iconic photos in a unique page if there is 
only a small set of photos. Further links to individual photos will lead to pages 
generated according to model 2 or model 4, depending on the context. 

It is worth to notice that model 2 and model 4 could be the same, with a 
parameter setting the context. It may be a quick way of prototyping the pages 
in a preliminary version of the CD-ROM. 

4.4 Dynamic generation versus materialisation 

The automatic generation of Web pages in Web-based Information systems can 
be either done in a dynamic mode or a compiled mode. Dynamic generation 
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of pages arises when a page is created at the time an user tries to access it. 
The pages are kept virtual and never stored in a Web site. Usually the page 
is created by a specific program (for example a Common Gateway Interface, a 
CGI-script, to a database), or from a specific query language whose statements 
are embedded into HTML pages [16]. Another approach is to compile all the 
pages into a materialised hypertext. Although for Web sites, one can argue with 
[15] whether a dynamic generation is preferred to the creation of actual pages, for 
a CD-ROM there is no benefit in keeping the pages dynamic. The main reason 
is that once the information is on the CD-Rom there is no way of updating it, 
and therefore the generated pages will never change. 

The dynamic generation is very useful though during the design and tuning 
of the page prescription, as well as during the process of selecting the appropriate 
photos, as we have explained before. 

5 Comparison with other works 

The main feature that distinguishes Norfolk from other approaches is that it 
is document-oriented. Virtual documents on the Web are usually generated 
through programs or scripts that sometimes can be embedded in an HTML 
document. This often means that the person responsible for the document (the 
creator has to rely on someone else (the programmer to deliver the information 
on the Web. Furthermore, these languages do not offer appropriate data types 
for the manipulation of document structures, which is crucial as virtual docu- 
ments need to extract and include parts of other documents, or to render other 
information in a document-like manner. 

The CD-ROM in [10] about architectural heritage in La-Rochelle was pro- 
duced by defining XML documents (according to an Inventory DTD) which were 
automatically translated into HTML using XSL. XML pages were translated ac- 
cording to XSL templates similar to our prescriptions. The difference is that our 
language can query several XML documents or other data sources, while XSL 
templates are applied to one XML page at time. It is a one page to one page 
translation. 

The Araneus system [1] uses a page model language to define Web pages and 
typed links between pages. Content of pages can be extracted from various data 
sources and integrated through a nested table model. Although the definition of 
true typed links between classes of pages is interesting we think that our tree 
model and page prescription language fits better with the XML/HTML models 
and languages. 

6 Conclusion 

We have shown how XML metadata can support the retrieval of photographs 
and the generation of advanced presentations using these photographs. The cost 
of creating the metadata makes it important to make them reusable for a variety 
of purpose and extendable to a particular application. Dublin Core can be used 
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for photographs after some adjustments, especially on the Format and Source 
elements. Dublin Core does have some limitations however when it comes to 
describing ’’surrogates”, or when several photographs represent the same object. 
Our approach with metadata has the following advantages: 

~ being textual and external to the image, our metadata descriptions are easier 
to read and to modify. 

— We can use an SGML query language to search or extract metadata (here 
we have used Norfolk query language). 

— Our metadata scheme is extendable, as we have shown by extending the 
Dublic Core with qualifiers and simple schemas. 

The generation of pages using virtual documents in our application has the 
following advantages: 

— There is a considerable gain in time: instead of writing many similar pages, 
only one prescription has to be written. 

— The pages are less prone to errors. The photograph metadata and document 
prescriptions, are written and checked once, and reused many times. 

— There is a separation between information and delivery (ie. metadata and 
HTML pages), which allows to easily define several presentations of the same 
information 

— Norfolk can support the flexible generation of presentation from heteroge- 
neous sources, which will be a requirement for the next generation of Digital 
Libraries. 

Future work involves the definition of models for comparison that support 
the generation of queries for appropriate comparable elements, and the display 
of these elements. 
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Appendix 1: Example of XML Metadata for the 
Photograph “AH3” 

<?xml version="l . 0"> 

<metadata lang="fr"> 

<DC:Title> Maison rurale \‘a Chavenay (78), Fremce </DC:Title> 
<DC:Creator> Anne-Marie Vercoustre </DC : Creator> 

<DC:Contributor> Labo Service; Art de Vivre, 78630 Drgeval, 

France </DC : Contributor> 

<DC:Date> 1990 </DC:Date> 

<DC:Subject xml : lang="fr "> 

maison, fa\c{c}ade, volet, fen\~etre, lucarne, toiture, ravalement 
</DC:Subject> 

<DC: Description xml : lang="fr"> 

Charmante petite maison de village 
</DC :Description> 

<DC: Description SCHEME= "habitat" xml : lang="fr"> 

<rural> 

<ravalement> Enduit couvrant traditionnel qui vient mourir 

sur les tableaux de fen\“etres. Le soubassement est 
traitX’e diff \ ’ eremment pour laisser respirer la pierre. 
</ravalement> 

<fenetre> Fen\“etres \‘a 6 carreaux</f enetre> 

<toiture> Jolie toiture; Lucarne \‘a capuche 
qui a \’et\’e raccourcie. 

</toiture> 

<volet> Volets pleins traditionnels \‘a simple barre. 

a couleur bleue pass\’ee est pleine de charme . 

</volet>S 

<etat> restaurX’e, bon </etat> 

</rural> 

</DC : Description> 

<DC:Publisher> Association Sportive et Culturelle des Alluets- 
le-Roi (LASCAR) </DC:Publisher> 

<DC: Identif ier>Photos/AH3</DC: Identif ier> 

<DC:Type> image </DC:Type> 

<DC:Type> color photograph </DC:Type> 

<DC: Type . quality> good </DC:Type.quality> 

<DC:Format> image/mpeg </DC:Format> 

<DC : Format .full> 800X537 </DC : Format .full> 

<DC : Format . small> 200X134 </DC : Format . small> 

<DC: Format . orientation> vertical </DC : Format . orientation> 

<DC:source> AH. 3 </DC:source> 

<DC: Source . type> color negative </DC : Source ,type> 

<DC:Rights> (c) Anne: Marie Vercoustre </DC:Rights> 

<DC:Relation> Habitat rural et sa restauration, collection 
de 150 photos sur CD-ROM 
</DC:Relation> 

<DC: Coverage SCHEME= "address" > 

<address> 
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<number >2 K/number> 

<street>Grand Rue</street> 

<town type=village>Chavenay</town> 
<country>France</ country> 

</address> 

</DC : Coverage> 

<DC:Coverage> Maule et environs, Yvelines, France 
</DC : Coverage> 

</metadata> 



Appendix 2: RDF Metadata for the Photograph AH3 

<?xml version="l . 0" encoding="IS0-8859-l?> 

<rdf : RDF xmlns : rdf = "http : //www. w3 . org/ 1999/02/22-rdf-syntax-ns#" 
xmlns :dc="http: //purl . org/dc/elements/ 1.0/" 
xmlns :dcq="http : //purl . org/dc/qualif iers/1 . 0/"> 

<rdf : Description rdf:about ="Photos/A3"> 

<dc:creator> Anne-Marie Vercoustre </dc : creator> 

<dc : contributor> Labo Service; Art de Vivre, 78630 Orgeval, France 
</dc : contributor> 

<dc : date> 

<rdf : Description> 

<rdf:value> 1990 </rdf:value> 

<dcq:dateType> photo shot </dcq: creatorType> 

</rdf :Description> 

</dc : date> 

<dc:subject xml : lang="fr "> 

maison, fa\c{c}ade, volet, fen\“etre, lucarne, toiture, ravalement 
</dc : subject> 

<dc : description> 

<rdf : Alt> 

<rdf:li xml : lang="f r"> 

Charmante petite maison de village </rdf:li> 

<rdf:li xml : lang="en"> Nice little village house</rdf : li> 

</rdf : Alt> 

</dc : description> 

<dc : description xml : lang="fr"> 

<rdf : Description> 

<rural : ravalement> Enduit couvrant traditionnel qui vient 

mourir sur les tableaux de fen\~etres. Le soubassement est 
traitX’e diff \ ’ eremment pour laisser respirer la pierre. 
</ rural : ravalement > 

<rural : f enetre> Fen\*etres \‘a 6 carreaux</rural : f enetre> 
<rural : toiture> Jolie toiture; 

Lucarne \‘a capuche qui a \’et\’e raccourcie. 

</ rural : toiture> 

<rural : volet> Volets pleins traditionnels \‘a simple barre. 

La couleur bleue pass\’ee est pleine de charme. 



</rural:volet> 
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<rural:etat> restaur\’e, bon </rural : etat> 

</ref : Description> 

</dc : description> 

<dc:publisher> Association Sportive et Culturelle des Alluets-le-Roi 
(LASCAR) </dc :publisher> 

<dc : identif ier>AH3</dc : identif ier> 

<dc : type> 

<rdf :Bag> 

<rdf:li> image </rdf:li> 

<rdf:li> <rdf : Description> 

<rdf:value>: color photograph </rdf:value> 

<dcq:typeType> photograph </dcq:typeType> 

</rdf :Description> </rdf:li> 

<rdf:li> <rdf : Description> 

<rdf:value>: good </rdf:value> 

<dcq:typeType> quality </dcq:typeType> 

</rdf :Description> </rdf:li> 

</redf :Bag> 

</dc :type> 

<dc:format> image/ jpeg </dc:format> 

<dc : format > 

<rdf : Bag> 

<rdf iliXrdf :Description> 

<rdf:value> 800x537 </rdf:value> 

<dcq: f ormatXype> fullsize </dcq: f ormatType> 

</rdf :Description> </rdf:li> 

<rdf:li> <rdf : Description> 

<rdf:value> 200x 134 </rdf:value> 

<dcq: f ormatXype> thumbnail </dcq: f ormatXype> 

</rdf :Description> </rdf:li> 

<rdf iliXrdf :Description> 

<rdf:value> vertical </rdf:value> 

<dcq: f ormatXype> orientation> </dcq: f ormatXype> 

</rdf :Description> </rdf:li> 

</rdf :Bag> 

</dc : f ormat> 

<dc:source> AH. 3 </dc:source> 

<dc : source> 

<rdf :Description> 

<rdf:value> color negative </rdf:value> 

<dcq: sourceXype> original </dcq: sourceXype> 

</rdf : Description> 

</dc : source> 

<dc:rights> (c) Anne: Marie Vercoustre </dc:rights> 

<dc : relation> Habitat rural et sa restauration, collection de 
150 photos sur CD-ROM 
</dc :relation> 

<dc : coverage> 

<rdf :Description> 

<rdf : value> 
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<adress : number >2 l</number> 

<adress : street>Gr£uid Rue</street> 

<adress :town type=village>Chavenay</town> 

<adress : country>France</ country> 

</rdf : value> 

<dcq: coverageType> location </dcq: coverageType> 
</rdf : Description> 

</dc : Coverage> 

<dc : coverage> Maule et environs, Yvelines, France 
</dc : coverage> 

</rdf : Description> 

</rdf : RDF> 




Audiovisual Cultural Heritage: 

From TV and Radio Archiving to Hypermedia 

Publishing 



Gwendal AufFret and Bruno Bachimont 

Institut National de I’Audiovisuel (INA)**, Direction de la Recherche Prospective 
4 Avenue de I’Europe, 94366 Bry-Sur-Marne Cedex 
{gauf fret ,bbachimont}@ina. fr 



Abstract. In this article, we present a model of digital audiovisual 
(AV) library. We describe how AV library users need to be provided 
not only with accurate and efficient ways to retrieve images and sounds, 
but also with new environments allowing to read and interpret these im- 
ages and sounds as AV documents. We show how library users perform 
an active reading of documents by contextualizing them using corpora of 
structured meta-information. This documentation consists of documents 
elaborated from previous readings of this AV content, such as producers’ 
files, critics, etc. It provides a good alternate representation as defined in 
[34]. We propose a model allowing library users to read AV documents 
not only along their documentation but from their documentation. This 
model is based on concepts from the electronic publishing world: it de- 
fines different levels of editorial control over the semantics, the structure 
and the layout of documentation and, in the end, allows the automatic 
generation of hypermedia applications, which we can be used as a new 
and efficient AV reading environment by library users. We also describe 
a prototype implementing parts of this model. 



1 Introduction 

For a long time, books have been considered as the only ’’real” cultural artefacts, 
whereas mass media in general, and TV and Radio in particular, were regarded 
as popular means of leisure, without any real cultural value. But recently, au- 
diovissual (AV) publications have been more and more recognized as part of our 
cultural heritage. New patrimonial AV libraries are being built, which appear to 
be quite different from traditional broadcast archives. Users of such libraries are 
scholars, journalists, school teachers and pupils who perform what we call an 
” active reading” of AV documents: they read and interpret these publications in 
order to write and publish their own essays, assignments or articles. This type of 

** INA, Institut National de I’Audiovisuel, is the French TV and Radio Archives. It has 
been archiving French TV since 1949 and French Radio since 1929. Its repository 
contains more than 3 millions documents, which comes up to 400 000 hours of video 
and 500 000 hours of audio programs. 
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reading requires specific means of access to AV documents and to their context 
of production, publication and reception. 

In this article, we propose a model for the creation of such digital AV libraries. 
First, we provide a definition of AV documents, which distinguishes them from 
AV streams and AV storage units. Secondly, we analyze the tasks performed by 
library users on such documents and we show that a digital AV library cannot 
be limited to a Video On Demand system or an image bank as many projects 
tend to do. In particular, we show that digital AV libraries can be considered 
as large scale publisher of structured documentation. As a result, we describe 
how the whole metadata generation process can be organized following four ma- 
jor steps: the inferential consistency step (semantic definition of descriptors), 
the descriptive consistency step (definition of the description scheme and con- 
tent indexing), the editorial consistency step (definition of the documentation 
structure) and the layout consistency step, which takes advantage of document 
management technology such as style-sheets in order to publish automatically 
hypermedia presentations from the documentation structure. Finally, we pro- 
vide an example of this electronic publishing chain and we show how it can be 
used in AV digital libraries to provide new means of browsing of TV and radio 
documents to their users. 

2 AV Document: A Definition 

We propose to extend the notion of document provided by R. Furuta in [18] in 
the following definition: 

Document : self-contained unit representing an identified intellectual contri- 
bution and published on a media for some specific purpose. A document 
exhibits, to a certain extent, an intentional structure which defines how the 
elements of its content are organized along its axes^. This structure is inter- 
preted by a reader as a testimony of the original publishing purpose. 

The above definition raises a question: how can the document unit be iden- 
tified? This apparently simple question is, indeed, in itself, a crucial problem for 
TV and radio libraries. If we consider the case of textual libraries, it appears that 
books, which are editorial units, often delimit also document units: the textual 
storage unit (z.e. the physical exemplary of the book) matches the extent of the 
intellectual contribution of the author, as well as the extent of the object used 
by readers to appropriate and interpret the content of the document. Unfortu- 
nately, as long as we are dealing with AV content, this matching disappears: a 
document can be stored on multiple tapes or film reels and these are not imme- 
diately readable by human beings, they have to be manipulated mechanically to 
re-produce images and/or sounds which are shown on a screen, and/or played 
using an amplifier. From the origin, the temporal nature of AV content imposes 
specific constraints on reading by separating the following three elements: 

^ Namely time and space for AV documents. 
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AV stream: an AV stream is any linear temporal succession of sounds and/or 
images following a specific rhythm that makes it understandable for a reader. 
This definition covers for instance media such as cinema, TV or radio. 

AV storage unit: as AV streams are temporal and continuous, they cannot 
be stored as such, they require spatial storing devices that will allow the 
re-creation of a certain piece of the AV stream by the mechanical manipula- 
tion of spatially represented information, we call these storing devices ”AV 
storage units” . Example of such storage units are film reels, beta SP or VHS 
video tapes, MIDI or MPEG-I files, etc. 

AV document: an AV document corresponds to a segment of an AV stream 
testifying of a specific editorial practice, which is stored on one or more AV 
storage units. For instance, the 8 O ’Clock news program of France 2, stored 
on a half Beta SP tape, can be considered as a document testifying of the 
editorial practice of this specific broadcaster. 



AV document 




Fig. 1. AV streams, AV storage units and AV documents 



These distinctions, which are illustrated on Figure 1, show that AV docu- 
ments cannot be restricted either to an AV stream (since streams are continuous 
and we have defined documents as discrete units), or to AV storage units^. In the 

Even if this approach constitutes the underlying model of many research projects 
today [22], [35], it cannot be accurate in the context of AV libraries. Indeed, AV stor- 
age units are technical artefacts, which are changing through time (a damaged film 
reel can be transcoded on a modern video tape) whereas the documents, themselves, 
remain unchanged (the 6 O’Clock news program of 1st of July 1957 remains the 
same document may it be stored on Beta SP or on VHS tapes). Moreover, quite of- 
ten, the storage unit boundaries are different from the document boundaries (a film 
is most of the time on multiple film reels) and different storage units can be used 
to store the same document for different purposes (the archive might store a high 
bitrate version for long term conservation and a low bitrate version for netcasting). 
In the end, building a digital AV library on a model that identifies AV documents 
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end, it is necessary to take in account the editorial practice in order to abstract 
the document. Let us consider, for instance, the two repositories represented 
at INA, the public service broadcasters’ archives and the legal deposit library: 
when describing the same stream (namely French public TV and radio), it ap- 
pears that they do not consider the same document units and structures. Indeed, 
the first archive is concerned with what has been published by the producer and 
it stores tapes on which are the original TV and radio programs created for the 
broadcasters, whereas the other is concerned by what has been published by 
the broadcaster [16] and therefore, it stores recordings of the AV stream that 
has been actually broadcasted, including the advertisement and the potential 
unexpected interruptions of the stream. 

This example shows that there is no such thing as an ’’objective” (or even 
commonly accepted) AV document unit and that AV documents are, in fact, 
constructed by the library from an interpretation of the original editorial prac- 
tice. 

3 Requirements on a Digital AV Library 

Once segmented and stored, documents must be described by librarians before 
being provided to users. Indeed there is no such thing as a ’’full AV stream 
search” mechanism to directly retrieve and manipulate elements of the AV con- 
tent since images and sounds are specific semiotic forms which, contrary to text, 
do not provide direct access to any discrete and semantically relevant unit. There 
is no equivalent to symbols, letters and words in AV streams, nothing that would 
be regarded by a large majority of users as semantic units and which could be 
used as a basis for search and computation processes. 

Many research projects try to address the complex issue of searching and 
retrieving AV content. In particular, as the current flourishing literature in the 
field of multimedia shows, providing access to AV content repositories implies 
not only to be able to count on stable compression and streaming standards 
such as the ones developed by MPEG [28], but it also implies thorough research 
work in the field of descriptor extraction [2], database technology [19,29], server 
delivery [11], information retrieval methods [13,35,17] and user interfaces [14]. 
These useful and relevant technical answers most of the time have one common 
goal, namely : provide new and efficient ways to search, select and retrieve moving 
images and sounds in digital repositories. 

Of course, this type of usage is crucial for many people and, in a sense, it 
corresponds to some of the traditional goals of TV and Radio archives, which 
were originally intended as a resource repository for producers and broadcasters. 
However, we claim that restricting the requirements on a digital AV library 
model to this usage would be merely considering AV libraries a huge Video 
On Demand (VOD) systems or image banks, which they are not. Indeed, library 
users are not only looking for images and sounds, they are involved into a certain 

to AV storage units might seem an easy technical solution but raises many problems 

on the long term. 
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reading task and they look for AV documents which they use as primary source 
material for their own work. This type of AV reading is not passive as the one 
anyone experiences in front of a TV set, it is an ’’active reading”, i.e. a reading 
activity which leads to the writing of a new document (may it be audio-visual or 
not). Such an active reading implies the thorough analysis, the contextualization, 
the interpretation and the rewriting of the document through annotation [33], 
which, if traditional a relatively easy for text (even if, in the digital domain, new 
tools are needed, as shown in [12]), remains scarcely developed for images and 
sounds. There are few tools targeting scholarly research on AV documents as, 
for instance, the FRANCK system [34]. 

In order to formalize these user-driven constraints, we provide here a set of 
requirements which apply to any digital AV library model. In our opinion, in 
order to provide an efficient service, a digital AV library should allow users to: 

1. search for AV documents: use efficient library tools (such as catalogs, thesauri 
or ontologies) or information retrieval methods to look for documents in the 
repository. This requires a coherent indexing of the content of AV documents; 

2. browse AV documents: access the content of AV documents and perform a 
non-linear reading (or viewing/listening) using traditional VCR functions 
(play, pause, stop, back, forward, etc.) or any mean of direct access allowed 
by the digital nature of the document; 

3. navigate in AV documents using metadata structures: access directly the 
content of AV document using efficient navigational tools such as temporal 
and spatial summaries and abstracts, tables of content, glossaries, etc. which 
guide the interpretation and help grasping the overall content of an AV 
document. 

4. interpret AV documents in context: access the documentation of the AV doc- 
ument. Indeed, an AV document, as any semiotic production, is never a stan- 
dalone object, it is always inserted into a communication process implying 
a production and a reception context. Along this chain, a lot of documents 
are created as a result of previous readings, which concern the AV docu- 
ment: from the author’s project to the critics’ articles and the production 
file, the rights management file, the script, the pictures taken during the 
shooting, the report of the sound recording session, the original shootings 
which have not been edited, etc. All these documents constitute a contextual 
corpus which can be used to guide the interpretation of the content of the 
document and to contextualize it from diverse points of view; 

5. annotate AV documents : write down their own interpretation of the AV 
document using annotations and use these annotations as a mean of browsing 
and documenting the AV content; 

4 Our Model: The Digital AV Library as an Hypermedia 
Publisher 

To fulfill the above requirements, we propose to apply the model illustrated on 
Figure 2. This model is based on the hypothesis that the digital AV library 
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Inferential Descriptive Editorial Layout 

consistency consistency consistency consistency 




Fig. 2. Life cycle of the documentation 



is, in fact, a large scale publishing house generating structured documentation 
and that this documentation can be used as an efficient mean of hypermedia 
browsing and contextual interpretation of the AV document content. Our model 
is composed of a cascading set of constraints and validation steps which allow a 
multi-level consistency check of the metadata published^. 

First, at the inferential consistency step, the library defines the semantics of 
its descriptors as well as the semantic relations among these descriptors. This 
step corresponds to the formalization of the domain knowledge. Secondly, at the 
descriptive consistency step, the library defines rules concerning the application 
of descriptors to spatial and temporal segments of AV documents and applies 
these rules during the indexing process. Third, at the editorial consistency step, 
the library defines how the indexes should be combined in order to create nav- 
igation tools such temporal, spatial or conceptual table of contents of the AV 
document and how related documents produced alongside the AV document 
production chain can be gathered and linked to the description as a piece of a 
hypermedia logical structure. This step corresponds to what we call the docu- 

® Please note that this hgure should not be regarded as a chronological representation 
of the library design process. Indeed, there are many feedback cycles among these 
different logical steps. For instance, it is impossible to dehne descriptors at the 
inferential consistency step without having studied before the user requirements for 
browsing and search AV documents at the layout consistency step, etc. 
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mentation of the AV document content. Finally, at the layout consistency step, 
the library defines how the documentation and the AV document must appear 
on screen as a hypermedia application. This step corresponds to the publishing 
of the documentation as an AV reading device. 



4.1 Inferential Consistency Step: Defining the Semantics of 
Descriptors 

The first concern of any digital AV archive is to control the semantics of its 
descriptors. Descriptors are symbols, linguistic terms or even icons [15], repre- 
senting a coherent set of semantically relevant elements of the content of AV 
documents. The interpretation of their semantics must be controlled as much as 
possible, if we want metadata to be interpretable and computable in an consis- 
tent and accurate way. 

Exactly as the document unit, the semantics of descriptors is an institutional 
choice: one specific library uses a term in its own specific sense, which might be 
different from other libraries. This is particularly true of broadcast archives, since 
the tradition of AV document description as well as the practice of bibliographic 
exchange in that field is far from being as developed as in the field of text. AV 
librarians and library users have not yet agreed on a common background such 
as the tag sets provided by the Text Encoding Initiative [10], for example. Each 
AV library has its own view of AV document content. Therefore, their is only 
one way to ensure consistency among descriptions and consistent interpretation 
by users: each AV library must provide clear (and as unambiguous as possible) 
definitions of the specific semantics of its descriptors. We propose to use an on- 
tology for this purpose. An ontology is traditionally defined as the ’’specification 
of a conceptualization” [20], and is used to represent the concepts associated 
with a domain. In this article, we will follow the definitions provided by [7], i.e. 
descriptors are linguistic terms. The semantics of these descriptors is defined by 
the location of the terms in an ”is-a” tree. Once the terms used in the domain 
are defined by experts using this tree structure, they can be used as primitives 
for any formal representation of the domain knowledge. 

We are currently working on the creation of an ontology, which would al- 
low the explicit representation of indexing methodologies used by the com- 
munity of AV archivers. We plan to express such ontologies using knowledge 
representation formalisms such as Resource Description Framework (RDF) (See 
http : //www . w3 . org/Metadata/RDF), which would then allow us to use inference 
engines to apply intelligent querying methods on the AV archive repository. 

4.2 Descriptive Consistency Step: Expressing Description Schemes 
and Validating Descriptions 

Once descriptors are defined, they have to be associated with AV documents. 
When AV documents were stored in analog format, descriptors were associated 
with the whole document unit. In a totally digital world, however, it is possible to 
perform a much more precise indexing and to associate descriptors with segments 
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of the AV document (a part of a document characterized by spatio-temporal 
coordinates). This type of indexing is often referred to as content indexing. In a 
first approach, we define an index as the association of a specific descriptor with 
a specific segment, which is quite similar to the notion of Primitive Annotation 
proposed by Prie & A1 [30] . Indexes make relevant substitutes for actual segments 
of AV documents: users can search, retrieve, manipulate segments by simply 
manipulating indexes, which are symbolic, and thus easily computable. 

In a library, the indexing is mostly created manually by document alists who 
interpret the content of documents. As we want the descriptions to be as sys- 
tematic as possible in order to allow efficient automatic processing, we need to 
constrain the creation of indexes: we use description schemes, which define lim- 
itations on the way a specific type of descriptor can be attached to a specific 
type of segment, limitations on which types of descriptors must or might be 
instantiated in order to validate the description, and limitations on the spatio- 
temporal borders of the segments indexed and their spatio-temporal relations. 
In this sense, a description scheme formalizes the indexing policy of the digital 
AV library with regards to a certain type of AV documents. 

Defining Structural Constraints on Indexes The description scheme con- 
tains not only the list of the index classes to be used in a specific type of de- 
scription, but it also defines some constraints on the instantiation of these index 
classes. These can be divided into two categories: 

— Cardinality constraints : constraints on the instantiation and cardinality of 
specific types of index. For instance, it can be specified that the description of 
news programs must contain one or more segments indexed by the ’’Anchor 
Man” descriptor. 

— Axial constraints: AV segments are organized along axis which, for audio- 
visual material, are temporal and spatial. Description schemes define con- 
straints on the instantiation of indexes along these axis, i.e. spatio-temporal 
constraints. For instance, it can be specified that any segment indexed by 
the ’’shot” descriptor should be temporally included into a larger segment 
indexed by the ’’scene” descriptor, or that, in the context of a document of 
type ’’news”, the anchor man ’’follows” or ’’overlaps” the titles, etc. 

The composition of cardinality and axial constraints defines a spatio-temporal 
grammar of the AV description which can be used for validation as it is done 
with SGML/XML Document Type Definitions (DTDs) for textual documents. 

Representing Formally Description Schemes and Descriptions Using 

XML In [4], we proposed to encode description schemes using an model devel- 
oped at INA, Audiovisual Event Description Interface (AEDI). 

Using DTDs to encode descriptions schemes In the context of the ACTS project 
DICEMAN^, we encoded AEDI in an XML-based syntax. We extended the XML 

^ http:/ /www.teltec.dcu.ie/diceman/ 
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DTD mechanism to express index types and constraints. This XML expression of 
our model was simple to implement and it has been a satisfying proof of concept 
which it has been presented to the MPEG-7 standardization body [6,1]. 

However, using DTDs to represent description schemes on AV documents 
appeared to be restrictive. Indeed, DTDs have been created for the encoding of a 
limited set of well identified grammatical constraints applying on a unit axis: the 
linear axis of characters composing a text. The transposition of these contraints 
into a spatio-temporal grammar proved to be complicated. It forced us to fix the 
semantics of DTDs in a new domain which was not bijective with the original one. 
For instance, the validation on temporal axis requires the expression of specific 
constraints such as Allen relations for instance [3]. On the other hand, the XML 
DTD did not provide any connector. As a consequence, we had to use the 
” ,” connector for separating element types in the content model. But what does 
this connector mean in a spatio-temporal context? For spatial segment, we can 
easily imagine that it is a succession constraints, but what does succession mean 
when it comes to spatial objects? In the end, some DTD rules could not be 
translated in a spatio-temporal environment and, as a consequence, remained 
over-constraining, whereas some others could not be expressed and controlled 
only by DTDs and required a second level of parsing. 

Using an XML schema to encode description schemes and description In the end, 
we decided not to use DTDs and to create our own format for the expression 
of description schemes. This format is based on an XML serialization of the 
AEDI model. XML is now just the syntax we use to declare the constraints on 
descriptions as well as the descriptions themselves. 

In description schemes, users define the axis of the document in a coordinate 
system quite similar to HyTime FCS [25]. Then, they specify the classes of 
elements which are usable in descriptions such as: 

— descriptor classes: descriptors are description elements with a name and at- 
tributes; 

— axial descriptor classes: descriptor classes with a content model on the axis 
of the descriptions. The instances of axial descriptors will be attached to 
segments of the document and constitute the core of the description tree 
structure. It is possible to specify if an axial descriptor has implied or explicit 
bounds on its axis, if some of the bounds are inherited from its parent or 
must be computed from its children’s bounds, if it is possible to define an 
order relation among its children on such or such axis, etc.; 

— value containers: attribute-value pairs, where the walue can be a standalone 
object (ex: title:string), a list or structure of objects (ex: Filmography :film-|-); 

Moreover, description scheme designers can express constraints on the instanti- 
ation of axial descriptors. For instance, it is possible to define that children of 
a specific axial descriptor class should not overlap on such or such axis, or that 
axial descriptors of class A and axial descriptors of class B should have such or 
such Allen relation (ex: they should always ’meet’) on a certain axis, etc. 




Audiovisual Cultural Heritage 



67 



Our model for description scheme allows the easy specification of simple 
structures such as traditional shot, scene and sequence trees (in this case, it is 
quite similar to an SGML DTD) as well as the specification of very complex 
n-dimensionnal structures for other uses. 

Once the description scheme is defined, descriptions conform to this descrip- 
tion scheme can be expressed using our XML schema as a list of empty XML 
elements related one to the other using links. As a result, elements can be written 
in the XML file in any order, which was not the case when we used DTDs since 
we had to identify the linear succession of characters to one axis of the docu- 
ment, namely time, to be able to validate our constraints. In our new format, 
we are much more independent from the textual constraints of XML. 

Once generated, descriptions can be validated against their description scheme 
using a validating java parser. This parser has already been implemented and 
should be used for the second phase of the ACTS-DICEMAN project. 



4.3 Editorial Consistency Step 

Most of today’s projects in the field of digital AV libraries stop their design at 
the previous step. Once indexes are anchored and metadata structures created 
using knowledge representation technology, they can be stored in a database and 
queried by information retrieval and automatic reasoning engines. The result of 
such a query is a piece of AV stream which can be watched by users. Sometimes, 
some elements of the meta-information related to the document (such as the title 
and the author) are provided to users in order to help them contextualizing the 
images and sounds they are looking at or listening to. However, we claim that 
AV libraries are more than such image or sound banks inasmuch as they do not 
only create but also gather and organize metadata in order to help the users’ 
reading and interpretation tasks. We call this phase, during which the library 
creates a logical structure contextualizing the AV document, the documentation 
phase. 



Gathering Related Documents From the Production to the Reception 

First an AV document, as any document or any semiotic production, is never a 
standalone object, it is always inserted into a communication process implying a 
production context and a reception context [34]. Along this chain, there is a lot 
of documents created about the AV document, from the author’s project to the 
critics’ articles and the production file, the rights management file, the script, the 
pictures taken during the shooting, the report of the sound recording session, etc. 
All these documents, which are collected by the AV library, constitute in fact a 
contextual corpus describing the content of the document from diverse points of 
view. From an interpretation point of view, providing access to such documents 
is crucial since, thanks to this corpus, twenty years after its creation, scholars 
can analyze the content of a TV or radio document accurately by referring to 
its actual context of production and reception [5] . 
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Organizing Indexes into Navigation Tools Indexes created by the library 
cannot be integrated as such in this contextual corpus. Indeed, as stated above, 
indexes are just independent pieces of information attached to the content of AV 
documents, they are used for finding content, but not to read it. When used as 
reading devices, indexes have to be organized in structures. Our hypothesis is 
that, similarly tables of contents, glossaries and indexes for texts, it is possible to 
create specific index structures for AV documents which can be used as efficient 
navigation tools. 

These navigation tools remain mostly to be invented and will certainly evolve 
with the stabilization of new AV reading usages in digital AV libraries. However, 
we can already point out some of them which have already been tested and 
proved to be useful, such as: 

— navigation along one or more coordinate axes: indexes are related to seg- 
ments of the document content. These segments are, themselves, locators re- 
lated to one or more axes following the dimensions of the document (namely 
mathematical time and space for AV documents). As a consequence, it is 
useful to provide views on indexes that would be organized along these axes. 
The traditional example is the temporal view, where the temporal or spatio- 
temporal indexes are grouped and organized by order of begin time in the 
AV document. 

— navigation by class of descriptor: indexes related segments to descriptors. We 
can therefore create a view of the AV document content that is organized by 
descriptor class. It is possible to group in the same structure element all the 
segments indexed by descriptor instances of the same class {e.g. collect all 
the segments indexed by the concept ’’Anchor Man”). This defines a sort of 
glossary of the AV document. Moreover, the organization of the descriptors 
in the ontology can be reused to organize such a glossary as a thesaurus. 

— navigation by projection upon a specific set of indexes: it is possible to decide 
that a specific set of indexes is the best segmentation or the best navigation 
clue in the document and, therefore, to create a structure that would ’’flat- 
ten” the stratification by projecting the different indexes from the different 
strata on one specific set of segments corresponding to one or more chosen 
descriptor. With this type of approach, it is possible, for instance, to build 
a shot-based annotation of the AV document by projecting every descriptor 
on the segments indexed as shots all the other descriptors. 

— navigation following the structure of another document: many documents 
from the contextual corpus have a structured content which, once related 
to indexes, offers an efficient mean of navigation through the AV document. 
This is the case for transcripts or commentaries, for instance. It is therefore 
possible to create specific index structures that would be projected upon the 
structure of a specific related document from the documentation. 

— creation of video summaries using templates: the idea here is to gather rele- 
vant segments of a corpus and to create a summary based the elements which 
are supposed to be the more relevant for a specific need. This approach can 
be based on assumptions on the montage strategy, which are transposed in 
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automatic tools for the indexing of images as in [26], on a speech recogni- 
tion process and a recomposition from a keyword search [21], or be totally 
controlled from a fixed template created by the library as in [27] . 

We are currently developing an integrated system that would allow us to 
experiment and assess there efficiency in different reading task as well as to help 
specific structures emerge, especially for pedagogical uses of AV documents 



Documentation or Hypermedia Logical Structure? As more and more 
AV content is produced directly in digital form and as multimedia metadata 
standards such as ISO MPEG-7®, the joint EBU-SMPTE task force®, W3C 
Metadata^ or the CEN/ISSS® or DAVIC® are emerging, the creation of such 
metadata structures will become necessary in order to help users find their way 
in all the AV data available and the different indexes generated automatically 
or annotated by hand. 

Moreover, the different pieces of the contextual corpus might be soon avail- 
able in digital form and transmitted directly with the AV stream in a standard 
way, along the production and distribution chain. Once received by the library, 
it will have to be transformed and adapted to fit into the documentation struc- 
ture of the institution. As a consequence, in the perspective of a totally digital 
archive, metadata and data can be linked and stored alongside on the same dig- 
ital medium in order to be manipulated by computer programs similarly. This 
means that the traditional separation between metadata and data is disappear- 
ing and that, in digital AV archives, metadata becomes data. 

This might sound trivial if we consider that, in textual libraries, metadata 
and data have always been expressed using the same semiotic form, text [8], 
but for AV archives and libraries, this digital convergence is a major move. 
Indeed, AV documentalists have to consider as one single object things that have 
traditionally been regarded as radically different. They were previously storing 
pieces of AV streams in AV storage units and documenting them with text: 
the AV document was an heterogeneous object which could not be manipulated 
as a single entity. Its unity remained virtual and, so to say, unreachable. Now 
that technology allows to store all the elements of the AV document as a single 
structure on a single digital media, AV librarians discover that they are storing 
and providing access to composite interrelated networks of images, sounds and 
texts, namely hypermedia logical structures. 

® MPEG-7 (’’Multimedia Content Description”). See http://drogo.cselt.it/mpeg/ 

® The European Broadcasters’ Union/ Society of Motion Pictures & Television Engi- 
neers task force has created a metadata dictionary which has been published as an 
international standard in 1998 and should be adopted by the EBU and NATO from 
1999 on. See http://www.ebu.ch/ 

^ World Wide Web Consortium, see http://www.w3c.org/Metadata/ 

® CEN Information Society Standardization System is currently involved 
in the Metadata for Multimedia Information initiative (MMI). See. 
http : // www2 .echo, lu/oii /en/ metadata, html 
® Digital Audio Visual Council, see http://www.davic.org/ 




70 



G. AufTret and B. Bachimont 



We are using SGML [23] to structure and validate the documentation. More- 
over, we integrated some of the concepts and mechanisms from HyTime [25], 
such as coordinate systems, in our architecture in order to be able to represent 
the spatio-temporal anchors of the descriptors. In the end, we obtain one single 
hypermedia logical structure which can be processed for publishing purposes as 
shown in the next section. 

4.4 Layout Consistency Step 

Once all the documentation has been gathered and organized as a SGML encoded 
hypermedia structure, what can we do to provide our library users with an 
efficient access to this information? 

We propose to generate a hypermedia presentation from this logical structure, 
which would provide interactive and dynamic means of reading AV documents 
from their documentation. However, we do not think it is reasonable to imagine 
that a large scale hypermedia application such as the document reading interface 
of an AV digital library can be generated using traditional hypermedia technol- 
ogy which tend to focus on ’’one shot” productions such as traditional cultural 
heritage GR-ROM production [32]. Indeed, these productions are expensive to 
built. They are often closed and it is very difficult to insert new elements without 
having to change the whole structure. In a context where new AV documents are 
being described and documented every day, we have to find another approach. 

Our approach is inspired by editorial techniques applied in the electronic pub- 
lishing industry. Indeed, as described above, the digital AV library applies many 
constraints and control to ensure that its metadata is consistent and coherent. 
As a result, we are provided with a highly structured and organized material 
that can be processed in a systematic way and, in particular, transformed auto- 
matically in hypermedia presentations following sets of rules. This is the point 
of view of Lloyd Rutledge & al [31], who show how SMIL^° hypermedia presen- 
tations can be automatically generated from HyTime structures using DSSSL 
style-sheets [24]. We are currently processing our SGML structures using tree- 
transformation scripts and a commercial software. In the future, however, we 
would like to use norms such as DSSSL to built a complete standard electronic 
publishing chain which would be independant from the software market. 

5 Implementation Issues 

As a prototype implementation of our model, we have explored the automatic 
generation of JavaScript applications using an SGML transformation tool. An 
example of such a presentation running in a client-server environment is provided 
on Figure 3. This interactive version of a documentary program produced by 
INA and France 3 has been created in collaboration with the director of the 

SMIL (Synchronized Hypermedia Presentation Language) is a W3C recommendation 

for the specification and the transmission of hypermedia presentations on the web. 

See http:/ /www. w3.org/TR/REC-smif/ 
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Fig. 3. An automatically generated hypermedia presentation 



film. We defined an ontology of descriptors and a strategy of description which 
was used during the production process. Moreover, we gathered documentation 
created during the shooting and the editing and we encoded it in SGML along 
the production chain. In the end, we obtain a hypermedia presentation which, in 
a limited version (copyrights problems for that kind of production remain huge!) 
has been web-casted on http://www.ina.fr/Production/Studio/caillois.en.html 
exactly at the time the TV program was broadcasted on the French public 
channel France 3. 

Users accessing this application can, of course, simply watch the AV stream 
using traditional VCR functions, but also read the transcript, which is aligned 
to the timeline. This transcript is linked, sentence by sentence, to the AV stream 
and can be used as a basis for full text searches or hyperlinks, as it is done in 
[ 21 ]. 

Moreover, users can access a thesaurus of keywords provided by the archive 
and by the director of the documentary and look for segments of AV stream 
indexed by these keywords. They can also look for the interventions of specific 
locutors and combine keyword searches to select more accurate segments. 

The original footage of the interviews have also been made available on-line. 
When users switch from the final AV document to a footage, the elements which 
have been selected for the final editing appear in red font in the text. This type 
of interface provides access to the origins of the document, which is extremely 
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relevant for scholars working on TV production methods and strategies, for 
instance. 

Finally, it is possible for users to bookmark temporal references by adding 
an annotation on the timeline of the document. These annotations, as well as all 
the indexes provided by the archive, can also be used for creating a new user- 
centered editing: the user selects segments and decides to look at the document 
only through these particular segments. In such case, the application creates a 
SMIL document on the fly and provides it to the server which plays only the 
selected segments. 

6 Future Work 

A lot of research and implementation work remains to be done before our digital 
AV library model is complete. In particular, we plan to explore the following 
issues: 

~ Assess the scalability of the model: our approach has been tested for the mo- 
ment on a prototype scale and broadcasters were interested by the result and 
the concept. We still have to assess it on a larger scale. First by developing 
it from the beginning till the end (a lot remains to be done in the field of the 
ontology creation for instance) and secondly by testing it thoroughly with 
the production department on large scale projects; 

— Links between knowledge representation and document structure: in our model, 
the inferential consistency step is obviously based on assumptions and models 
from the knowledge engineering world, whereas the other steps are directly 
inspired by the electronic publishing community. With the development of 
the web, these two paradigms become closer and closer: people need to ma- 
nipulate more easily the semantics of their documents and document man- 
agement systems become a major target for artificial intelligence technology. 
However, the combination of semantics and grammar is not as easy as it may 
seem and there is still a long way to go towards the ’’Semantic Web” an- 
nounced by Tim Berners-Lee [9]. As a consequence, we are currently working 
on a framework to express and manipulate the constraints of the electronic 
publishing world into the formalisms and representation paradigm of the 
knowledge engineering world; 

— Multiple points of view presentations: a hypermedia presentation of a AV 
document such as the one illustrated on Figure 3 can be described as ’’video- 
centric” . Most of the presentation is driven by the AV stream, the other types 
of available data being considered as satellites. In the future, we would like to 
enlarge this approach by allowing the exploration of the same documentation 
structure through diverse entry points. In particular, each element linked 
in the original base can be considered as the focus, or the centre of the 
browsing. However, browsing a video from a text and browsing a text from 
a video is not quite the same thing. Therefore, we are working on dynamic 
style-sheets, which could compute, from the characteristic of the element in 
focus, the appearing on screen of description elements and of links; 
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~ Multiple delivery: we are currently generating hypermedia presentations au- 
tomatically, but, as we are working with a SGML encoded corpus, nothing 
forbids to imaging other style sheets that would allow the automatic gener- 
ation of a paper book or an audio tape at the same time, which could be 
considered as multiple deliveries on the same AV document structure. We 
are analyzing this opportunity with INA’s production department. 

— User annotations: even if we integrated some level of annotation function- 
ality in our prototype we would like to create more generic tools allowing 
users to personalize their browsing of the AV document by adding their own 
annotations (i.e. their own indexes) on the top of the editorial structure pro- 
vided by the library. In such a case, they would be able to perform in fact 
their own editing of the document and can read it from this new point of 
view. 

We are currently working on a project for the use of documented filmed 
theater for literature courses in high schools and we think this experience will 
allow us to experiment some of these points and to assess our main hypothesis. 

7 Conclusion 

In this article, we introduced our model of digital AV libraries. We distinguished 
four major steps in the design of the AV libraries: the inferential consistency step, 
the descriptive consistency step, the editorial consistency step and the layout 
consistency step. At each of these design steps, we showed that the library applies 
specific control mechanisms on the semantic of its descriptors, the validity of its 
indexing, the documentation structure and, finally, the way this structure can 
be processed to generate automatically generation of hypermedia presentations. 
We provided an example of implementation and we describe how the publishing 
of such user interfaces by digital AV libraries would allow non linear reading of 
AV documents from their documentation. 

In our opinion, the underlying concepts of this model, though developed 
specifically for the purpose of patrimonial digital AV libraries, can be applied to 
much larger contexts. Indeed, as long as users of new a repository need precise 
contextual information for the interpretation of the documents (as in archives 
dedicated to industrial projects for instance), the type of reading task we are 
targeting will be necessary and, therefore, the same requirements will apply. 
Users of such systems cannot be provided only with a video or an audio player 
and a search engine, they need more structured and interactive presentations 
which can be created at low cost only by apply thorough control on the indexing 
and documentation process in order to allow efficient hypermedia publishing. 
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Abstract. This paper describes an application which enables the computer- 
assisted generation of Dublin Core-based metadata descriptions and online 
digital visual summaries for videos. It is a Java application which integrates a 
video replay window with vcr-type controls and metadata input forms generated 
from an hierarchical RDF schema. The schema definition is also used to vali- 
date the descriptions input by the user and control the format of the output. The 
generated metadata descriptions can be saved as RDF, HTML or to a database. 
They can be used to enable metadata interchange, searching across the Internet 
or dynamic generation of detailed visual summaries for video browsing. This 
prototype system has been developed for the State Library of Queensland’s 
(SLQ) Audiovisual unit to enable quick, easy, cost-effective generation of stan- 
dardized metadata which can be used to create online detailed visual summaries 
of the latest video acquisitions. 



1 Introduction 

Technological advances are changing the roles of libraries. They must learn to provide 
enhanced access through improved reference services. Librarians today need to ana- 
lyze, interpret and evaluate the vast array of information sources available. Develop- 
ments in information technology are providing librarians with the opportunities to add 
value through continuous effective evaluation of information sources. Whitlach [1] 
suggests that libraries are migrating through three generations: 

1 . The first generation introduced automated online catalogs and CD-ROM indexes; 

2. The second generation added enriched records which include tables of contents, 
summaries for books, abstracts for periodicals; 

3. The third generation will incorporate evaluative elements into records e.g. notes on 
the type of audience, author’s qualifications, purpose and scope of the work, links to 
reviews and other related resources. 

S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL ‘99, LNCS 1696, pp. 76-91, 1999 
© Springer-Verlag Berlin Heidelberg 1999 
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Audiovisual collections traditionally lag behind textual collections in terms of basic 
services such as online access, search and retrieval. A book or periodical can be 
browsed by glancing through its table of contents. Audiovisual resources provide no 
such simple and effective summarial information. In addition the complex hierarchical 
information-rich nature of audiovisual collections causes value-added services such as 
analysis, interpretation and evaluation which librarians are expected to provide, to 
become much more difficult, subjective and time consuming. 

Hence the goal of this project is to provide computer-based tools which will enable 
first-generation audiovisual libraries to catch up to third-generation text-based libraries 
by providing enhanced services through the generation of online summaries for quick 
browsing and links to reviews for evaluation and interpretation. We achieve this 
through the application of standardized Dublin Core metadata contained within an 
RDF schema. This approach enables existing Dublin Core-based search engines to be 
extended to search across different media types and also within specific video seg- 
ments. 



2 Objectives 

This prototype was designed, implemented and tested using some the latest video 

documentaries acquired by the State Library of Qld’s audiovisual collection [2]. The 

objectives of this research project were: 

• to provide a tool which enables standardized metadata and video summaries to be 
generated though a simple cost-effective, computer-assisted process; 

• to increase the usage of the audiovisual collection by providing Internet access to 
visually impressive detailed summaries using a combination of text, still images 
and video clips; 

• to reduce the time required by film and media researchers to locate particular, rele- 
vant video content; 

• to automate links to related interpretive and evaluative resources such as reviews, 
articles, papers or other works by the same creators; 

• to investigate qualifiers for the basic Dublin Core element set [3] that extend its 
descriptive semantics to the specific characteristics of video objects and enable 
their resource discovery; 

• to investigate the utility of the Resource Description Framework (RDF) [4] for 
expressing this qualification framework, by developing an RDF schema and vali- 
dating descriptions against this schema; 

• to extend an existing Dublin-Core based search engine (DSTC’s HotMeta) [5] so 
that it can handle video metadata and investigate its ability to search and retrieve 
different media types. 
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3 Video Indexing 

3.1 Traditional Video Cataloguing 

Currently only bibliographic type indexing, based on US MARC tags, is performed on 
new video acquisitions at the State Library of Qld. Table 1 below shows the catalogue 
record, accessible via the library’s OP AC, for one of the recently acquired documenta- 
ries from the NetPAC series. We use this particular video throughout the remainder of 
the paper as an example for illustrative purposes. 



Table 1. Marc Record for Example Video 



MARC 

Tag 


Description 


Value 


082 


Dewey Decimal Classification 


305.89921 


245 


Title 


The Sex Warriors and the Samurai 
(videorecording) 


260 


Publication Details 


London: Formation Films for Channel 
Four, 1995 


300 


Physical Description 


1 videocassette (27 min.) : sd., col. ; 
1/2 in 


514 


Language 


In Tagalog with English subtitles 


522 


Credits 


Producer, Parminder Vir; Written and 
directed by Nick Deocampo. 


545 


Summary 


Documentary about Jo-an who works 
in bars in Manila performing a drag 
act, and as a prostitute to support 
himself and his impoverished family. 
He works to get a work visa to enable 
him to go to Japan where he can earn 
money. 


650 


LCSH 


Female impersonators - Philippines - 
Manila 


651 


Geographic SH 


Manila (Phihipines) - Social Condi- 
tions 



3.2 Digital Video Indexing 

Typically the indexing of a digital video program consists of the following steps: 

1. Segment the video hierarchically into sequences, scenes, and shots. (A shot is a 
continuous sequence of frames captured from one camera. A scene is composed of 
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one or more shots which present different views of the same event, related in time 
or space. A segment is composed of one or more related scenes.) 

2. Describe the complete video using bibliographic information (title, creator, dates, 
subjects, item numbers, publisher details, names, synopsis etc.) plus format, fram- 
erate, duration etc. 

3. Describe each sequence - id, start time/frame, end time/ffame, brief textual sum- 
mary. 

4. Describe each scene - id, start time/frame, end time/frame, brief textual summary, 
transcript (ideally derived from a closedcaption decoder. 

5. Describe each shot - id, start time/ffame, end time/frame,keyffame (first frame of 
the shot, ideally derived from an automatic shot detection algorithm). 

If closed captions are available then a closed caption decoder may be used to ex- 
tract the transcript. However the majority of the videos which are acquired by the 
audiovisual unit do not contain closed captions. And when they do, the method by 
which the closed captions are encoded varies from country to country, often making 
their extraction highly problematic. Assuming closed-captions have been extracted, 
then there is still a need for the time-consuming step of generating scene summaries 
from the transcript. 

Software is available [6,7] which is capable of detecting scene changes (the first 
frame from each new scene) and saving them as image files e.g. GIF or JPEG images. 
However there are a number of problems with this automated approach. Scene change 
detection methods are improving but they can experience difficulties when there are 
fades, pans between scenes or fast motion within scenes. But more importantly, scene 
changes are often not the most visually impressive, aesthetic, evocative or even repre- 
sentative images for a particular scene. 

For our particular application, we propose that the best method is a computer- 
assisted, interactive, human-controlled approach, at least until the following technolo- 
gies become available: 

• Image processors which are capable of measuring aesthetics or visual impact; 

• An international closed caption encoding standard or decoders capable of handling 
the different video formats (PAL, NTSC) and closed-caption encoding methods; 

• Software which can automatically generate summaries from transcripts. 



3.3 Video Metadata 

An earlier paper [8] compares various alternative approaches to video metadata and 
concludes that the ideal approach is to combine Dublin Core [3] and MPEG-7 [9] 
metadata descriptions within an aggregated RDF description [4, 10]. Dublin Core can 
be used for high level generic searching across both text and multimedia objects whilst 
the MPEG-7 (Multimedia Content Description Interface) standard will eventually 
enable the low-level fine grained media-specific search and retrieval. Since MPEG-7 
is still at such as early stage of development, we have decided to investigate the capa- 
bilities of a pure (qualified) Dublin Core approach. The additional advantage of this 
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approach is that existing Dublin-Core based search engines can be used with very little 
modification. 

Below we show how Dublin Core can be applied to index the documentary de- 
scribed above. For the search and retrieval requirements of most of the users of the 
audiovisual collection, it is sufficient to provide only a bi-level metadata structure. 
Complete video documents sit at the top level and each video document contains a 
number of consecutive scenes which sit at the secondary level. Below are the descrip- 
tions for the complete video and the first scene. The descriptions for the other scenes 
use the same elements as for scene 1 but the content is different. 

Most of the metadata for the complete video document can be retrieved directly 
from the existing MARC records by mapping MARC tags to Dublin Core elements. 
Some fields require simple reformatting or pre-processing. 

The metadata generator application described in section 4, enables the descriptions 
to be input into forms displayed alongside the actual video replay window. 

• Complete Video Documentary 

Title = "The Sex Warriors and the Samurai" 

Creator = "Producer, Parminder Vir; Written and directed by Nick Deocampo." 
Subject = "Female impersonators - Philippines - Manila" 

Description = "Documentary about Jo-an who works in bars in Manila performing 
a drag act, and as a prostitute to support himself and his impoversihed family. He 
works to get a work visa to enable him to go to Japan where he can earn money." 
Publisher = London: Formation Films for Channel Four, 1995 
Date = 1995 

Type = "Image. Moving.Film.Documentary" 

Format = 1 videocassette (27 min.) : sd., col. ; 1/2 in 

Identifier = 305.89921 

Source = QVC 305.89921 sex vhs 

Language = In Tagalog with English subtitles 

Relation.HasPart ^ scenel, scene2, scene3, scene4, scene5,... 

Coverage = Manila (Philippines) - Social conditions 

• Scene 1 

Description, transcript Jo-an has worked for 10 years in bars like this in Manila. 
Now with the Government’s crack-down on the flesh industry, Jo-an finds it diffi- 
cult to suport a family. The yen is luring Filipinos away and Jo-an has been deter- 
mined to leave for Jin Ch Pui, Japan’s entertainment capital. I wanted to find out 
what Jo-an has to go through to make the journey to the land of the samurai." 
Description. keyframe="http://-www.s\q.q\d.gov.au/av/scens\. gif' 

Description.clip = "http://www.slq.qld.gov.au/av/scenel.rm" 

Type = "Image. Moving.Film.documentary. scene" 

Format, length = 2min 25 secs 
Coverage. t. min scheme=SMPTE content^ 00:00:00;0 
Coverage. t. max scheme=SMPTE content^ 00:02:25;25 
Relation. IsPartOf= video_doc 
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3.4 The RDF Schema and Description 

An RDF Schema [11] is used to define the hierarchical structure of the video docu- 
ments and the attributes associated with each level. It is used to generate the form 
fields for inputting and editing the metadata for both the complete video document and 
each of the scenes. It is also used to constrain and validate the input and to define the 
RDF output file format. 

In the schema, the fifteen Dublin Core properties are associated with the top-level 
Video_document class. In addition, there is a contains property, whose domain is the 
Video_document class and whose range is the Scene class. The Scene class is defined 
as a sub-class of Video_document so that it inherits all of the Dublin Core properties. 
In addition the Scene sub-class has its own additional descriptive properties: duration, 
startTime, endTime, keyFrame, clip and transcript. 

Figure 1 below illustrates the RDF data model for the video documents in this ap- 
plication. The complete RDF schema is shown in the Appendix. 




subclass of 




startTime 
endTime 
duration 
keyFrame 
clip 
transcript 



Fig. 1. RDF Data Model for Video_document 



4 The "Veggie" Video Metadata Generator 

The original idea behind the Video Metadata Generator was to extend DSTC’s Reggie 
application [12], a metadata generator and editor for textual documents, to video 
documents. Two key differences between the Reggie application and the "Veggie" 
application had a major impact on design considerations: 

1 . The need for an integrated video replay window. This enables the user to simulta- 
neously view the video and edit the metadata descriptions, and to segment and in- 
dex the video via vcr controls and links between the video window and the meta- 
data fields. 
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2. The hierarchical, layered nature of video documents which demand different meta- 
data forms for each layer. For example the top level video document has different 
metadata fields to the lower level scene segments. 



4.1 The System Architecture and User Interface 

Figure 2 below illustrates the overall system architecture. A number of recent acquisi- 
tions from the SLQ Audiovisual Unit’s video collection were digitized using a Broad- 
way video capture card. Both mpegl and realmedia formats were captured. Metadata 
for a particular video is created by running the Veggie (Java) application and opening 
the corresponding mpegl video file. This displays the video file in a video window 
and reads an RDF schema file to generate a form containing fields for the metadata 
entry. 




Fig. 2. System Architecture 

The metadata for the complete video document is entered first. Alternatively this 
can be read from the existing catalogue by mapping the US MARC records into the 
Dublin Core fields. The user interface is shown in a screen capture in Figure 3 below. 

After the metadata for the complete video document is entered, the Veggie applica- 
tion enables users to segment the video into "scenes" using the vcr controls and a 
timeline on the video window and to enter the metadata for each scene. 

A scene-level metadata form (Figure 4) is generated from the RDF schema file and 
the temporal (startTime, endTime, duration) and keyFrame values can be specified in 
the vcr window and inserted automatically into the relevant metadata fields via links 
between the form fields and the vcr window. Users can also enter the transcript for the 
current scene. 
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Fig.3. Top-level Video Metadata Entry for Complete Video 




Fig.4. Scene-level Metadata Entry Interface 

Once the metadata entry for the currently loaded video is completed, then users have a 
choice of saving it as RDF or FITML 4.0. Examples of output are shown in Section 4.2 
below. 
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4.2 The Output 

Currently users can save their metadata descriptions as RDF or FITML 4.0. Ideally, 
the system could also be configured so that Veggie is integrated directly with HotMeta 
and the metadata is saved directly into the HotMeta repository. Alternatively, the SLQ 
may want to save the metadata into their own database and generate the HTML sum- 
maries dynamically from this database using CGI or Java scripts. 

4.2.1 RDF Output 

Below is the RDF metadata description for the SLQ video "The Sex Warriors and the 
Samurai" described above. The description has been generated from the video meta- 
data generator based on user input and the RDF schema described above. Only the 
metadata for the first scene is shown. The metadata for the other eight scenes is simi- 
lar and can be deduced from this. 

<?xml version= " 1 . 0 " ?> 

<RDF xmlns= "http ; / /www . w3 . org/1999/ 02/22 -rdf -syntax-ns# " 
xmlns ;dc="http : //purl . org/ dc/ elements/ 1.0/" 
xmlns ; dcq= "http ; / /purl .org/ dc/qualif iers/1 . 0/ " 
xmnls : videoschema= "http : //www . dstc . edu . au/video- 
schema# " > 

<Description about= "http ; //www . . . /sex_warriors .mpg" > 

<rdf;type resource= " #Video" /> 

<dc : Title>The Sex Warriors and the Samurai</dc ; Title> 

<dc : Creator>Producer , Parminder Vir; Written and Di- 
rected by Nick Deocampo . </dc ; Creator > 

<dc : Sub j ect>Female impersonators - Philippines - Manila 
</dc ; Subject > 

<dc : Description>Documentary about Jo-an who works in 
bars in Manila performing a drag act, and as a prostitute 
to support himself and his impoverished family. He works to 
get a work visa to enable him to go to Japan where he can 
earn more money . </dc ;Description> 

<dc : Publisher>London : Formation Films for Channel Four 
</dc ; Publisher> 

<dc : Date>1995</dc : Date> 

<dc : Type>Image . Moving . Documentary</dc : Type> 

<dc:Format>l videocassette (27 min.) : sd., col. ; 1/2 

in</dc ; Format > 

<dc : Identif ier>QVC 305.89921 sex vhs </dc : Identif ier> 

<dc : Language>In Tagalog with English subtitles 
</dc : Language > 

<dc : Coverage>Manila (Philippines) - Social conditions 
</dc : Coverage> 

<videoschema : contains> 

ctd. over page 
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<Seq> 

<LI ID="scenel" 
<LI ID="scene2" 
<LI ID=" scenes" 
<LI ID="scene4" 
<LI ID=" scenes" 
<LI ID="scene6" 
<LI ID="scene7" 
<LI ID="scene8" 
<LI ID="scene9" 
</Seq> 

< /Description> 



Resource= "http ; / /www.../ sex_warriors .mpg#scenel " / > 
Resource= "http ; / /www.../ sex_warriors .mpg#scene2 " / > 
Resource= "http ; / /www.../ sex_warriors .mpg#scene3 " / > 
Resource= "http ; / /www.../ sex_warriors .mpg#scene4 " / > 
Resource= "http ; / /www.../ sex_warriors .mpg#scene5 " / > 
Resource= "http ; / /www.../ sex_warriors .mpg#scene6 " / > 
Resource= "http ; / /www.../ sex_warriors .mpg#scene7 " / > 
Resource= "http ; / /www.../ sex_warriors .mpg#scene8 " / > 
Resource= "http ; / /www.../ sex_warriors .mpg#scene9 " / > 



<De script ion About = "http : / /www... . / sex_warriors .mpg#scenel " > 
<rdf:type resource= "#Scene " /> 

<dc ; Type>Image . Moving . Documentary . Scene</dc ; Type> 

<startTime>00 ; 00 : 00 ; 0</ startTime> 

<endTime>00 ; 02 : 25 ; 25</endTime> 

<duration>2 mins 25 secs</duration> 

<keyFrame>http : //www.../sex_warriorsl .gif </keyFrame> 

<clip>http : //www.../sex_warriors . rm</clip> 

<transcript>Jo-an has worked for 10 years in bars like this in 
Manila. Now with the Government's crack-down on the flesh indus- 
try, Jo-an finds it difficult to support a family. The yen is 
luring Filipinos away and Jo-an has been determined to leave for 
Jin Ch Pui, Japan's entertainment capital. I wanted to find out 
what Jo-an has to go through to make the journey to the land of 
the samurai . </transcript> 

< /Description> 



etc similar descriptions for the other scenes 



</RDF> 



4.2.2 HTML Output 

If the metadata is saved as HTML, then a web page is created automatically which 
represents a visual summary of the video. The HTML output for the video example 
"Sex Warriors and the Samurai" can be found at [13]. The metadata for the overall 
video document is saved as metatags in the header of the HTML file, as shown below. 
This complies with the Internet Draft by John Kunze [14] which specifies how Dublin 
Core can be encoded in HTML. The keyframes for each scene are layed out sequen- 
tially and below each keyframe are the corresponding time stamps and transcript. 
Clicking on a keyFrame causes the associated realmedia clip to be played. 
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<META NAME= "DC. Title" CONTENT="The Sex Warriors and the Samu- 
rai " > 

<META NAME= "DC . Sub j ect " CONTENT= " Female impersonators - Philip- 
pines - Manila" > 

<META NAME= "DC . Creator " CONTENT= " Producer , Parminder Vir; Writ- 
ten and directed by Nick Deocampo"> 

<META NAME= "DC . Description" CONTENT= "Documentary about Jo-an who 
works in bars in Manila performing a drag act, and as a prosti- 
tute to support himself and his impoversihed family. He works to 
get a work visa to enable him to go to Japan where he can earn 
money . " > 

<META NAME= "DC . Publisher " CONTENT= " London : Formation Films for 

Channel Four, 1995 "> 

<META NAME="DC.Date" CONTENT= " 19 95 " > 

<META NAME = "DC . Type" CONTENT= " Image . Moving . Film . Documentary" > 
<META NAME=" DC. Format" C0NTENT="1 videocassette (27 min.) : sd., 
col . ; 1/2 in " > 

<META NAME="DC. Identifier" CONTENT = "305.89921"> 

<META NAME=" DC. Source" CONTENT="QVC 305.89921 sex vhs"> 

<META NAME =" DC. Language" CONTENT="In Tagalog with English subti- 
tles " > 

<META NAME=" DC. Coverage" CONTENT= "Manila (Philippines) - Social 
conditions " > 

<META NAME= "DC. Rights" CONTENT= " Copyright Formation Films Ltd 
1995"> 



4.3 The Search Engine 

DSTC’s existing HotMeta Search Engine [5] crawls over specified sites, extracts and 
indexes metadata from embedded HTML Meta tags and saves it in a metadata reposi- 
tory. 

If the output from Veggie is saved to HTML 4.0 and added to a site traversed by 
HotMeta, then the metadata will be added to the HotMeta repository. Subsequently, 
any searches using HotMeta which match the video metadata will retrieve the video 
summary page. 

Certain extensions could be made to both Veggie and HotMeta to improve the inte- 
gration between them. It is possible to insert the video metadata descriptions directly 
into the HotMeta repository and to retrieve links to the actual video clips rather than 
the summary pages. 

We would also like to investigate replacing the existing GIF keyFrame images with 
PNG [15] images in which the metadata is embedded and extending HotMeta to ex- 
tract metadata from the PNG images and add it to the repository. 



5, Conclusions and Future Work 

We have developed an application which can be used by audiovisual librarians to 
quickly and easily create detailed, visually-stunning summaries of their latest video 
acquisitions. At the same time, the application generates standardized metadata de- 
scriptions which can be used by existing Dublin Core-based Internet search engines to 
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enable the resource discovery of video documents. We believe that this tool will en- 
able librarians to provide third-generation, value-added services for audiovisual col- 
lections, such as they are currently providing for textual resources. 

Future Work includes: 

• Integrating the digitization process and the metadata input process. Currently these 
are two separate processes which ideally would be carried out within a single appli- 
cation. 

• Installing and testing the prototype within the SLQ Audiovisual unit. So far the 
system has only been tested at DSTC using SLQ content. We would like to investi- 
gate the system’s feasability within a real library environment. In particular, we are 
interested in determining the average time it will take an experienced video cata- 
loguer to generate the metadata for each new acquisition. 

• Determining whether the online visual summaries generate increased interest and 
usage of the audiovisual collections by monitoring the number of hits to the SLQ 
Audiovisual web site and the borrowing statistics before and after deployment of 
this system. 

• Integrating Veggie with HotMeta by enabling the metadata to be added directly into 
the HotMeta metadata repository. 

• Extending HotMeta to enable viewing of retrieved realmedia video clips. 

• Replacing GIF and JPEG scene changes with PNG images containing embedded 
metadata. 

• Automating the search over authoritative film review sites to automatically retrieve 
related reviews, articles and links to other works by the same creators. 
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Appendix: The RDF Schema 



<rdf ; RDF 

xmlns : rdf = "http://www.w3.Org/1999/02/22-rdf-syntax-ns#" 
xmlns : rdf s= "http : //www . w3 . org/TR/PR- rdf -schema# " 
xmlns : dc= "http : //purl . org/dc/elements/1 . 0/ " > 

<rdfs:Class ID= " Video_document " > 

<rdf s : comment>Class for representing a generic video document 
</rdf s : comment > 

</rdf s : Class> 

<rdf s : comment>Def ine all of the DC elements for Video_document 
< / rdf s : comment > 

<rdf : PropertyType ID="Title"> 

<rdf s : comment>This is the DC Title element </rdfs : comment > 

<rdf s : domain rdf : resource= " #Video_document " > 

<rdf s : range 

rdf : re sour ce= "http : / /purl . org/ dc/elements/1 .0/#Title"/> 

</rdf : PropertyType > 

< rdf : PropertyType ID= " Creator " > 

<rdf s : comment>This is the DC Creator element </rdfs : comment > 
<rdf s : domain rdf : resource= " #Video_document " > 

<rdf s : range 

rdf : resource = "http : //purl . org/dc/elements/1 . 0/#Creator " /> 
</rdf : PropertyType > 



< rdf : PropertyType ID= " Sub j ect " > 

<rdf s : comment>This is the DC Subject element </rdfs : comment > 
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<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/#Sub j ect " /> 
</rdf : PropertyType> 

<rdf : PropertyType ID="Description" > 

<rdf s ; comment>This is the DC Description element 
</rdf s : comment > 

<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/ #De script ion" /> 
</rdf : PropertyType > 

<rdf : PropertyType ID= " Publisher" > 

<rdf s ; comment>This is the DC Publisher element </rdfs ; comment > 
<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resource = "http : //purl . org/dc/elements/1 . 0/#Publisher " /> 
</rdf : PropertyType > 

<rdf : PropertyType ID= " Contributor" > 

<rdf s ; comment>This is the DC Contributor element 
</rdf s : comment > 

<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resource = "http : //purl . org/dc/elements/1 . 0/#Contributor " /> 
</rdf : PropertyType > 

< rdf : PropertyType ID="Date"> 

<rdf s ; comment>This is the DC Date element </rdfs : comment > 

<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/#Date " /> 

</rdf : PropertyType > 

< rdf : PropertyType ID="Type"> 

<rdf s ; comment>This is the DC Type element </rdfs : comment > 

<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/#Type " /> 

</rdf : PropertyType > 

< rdf : PropertyType ID=" Format "> 

<rdf s ; comment>This is the DC Format element </rdfs : comment > 

<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resource = "http : //purl . org/dc/elements/1 . 0/#Format " /> 

</rdf : PropertyType > 

<rdf ; PropertyType ID= " Identifier" > 

<rdf s ; comment>This is the DC Identifier element </rdfs : comment > 
<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/#Identif ier " / > 
</rdf : PropertyType > 

< rdf : PropertyType ID=" Source "> 

<rdf s ; comment>This is the DC Source element </rdfs : comment > 

<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 
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rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/#Source " /> 

</rdf : PropertyType> 

<rdf ; PropertyType ID= "Language " > 

<rdf s ; comment>This is the DC Language element </rdfs : comment > 
<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/#Language " /> 
</rdf : PropertyType > 

<rdf : PropertyType ID= "Relation" > 

<rdf s ; comment>This is the DC Relation element </rdfs : comment > 
<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resource = "http : //purl . org/dc/elements/1 . 0/#Relation" / > 
</rdf : PropertyType > 

<rdf ; PropertyType ID= " Coverage " > 

<rdf s ; comment>This is the DC Coverage element </rdfs : comment > 
<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/#Coverage " /> 
</rdf : PropertyType > 

< rdf : PropertyType ID="Rights"> 

<rdf s ; comment>This is the DC Rights element </rdfs : comment > 

<rdf s ; domain rdf : resource= " #Video_document " > 

<rdf s ; range 

rdf ; resour ce= "http : //purl . org/dc/elements/1 . 0/#Rights " /> 

</rdf : PropertyType > 

<rdf s : comment>Def ine the Scene class and its properties 
< / rdf s ; comment > 

<rdfs: Class ID="Scene"> 

<rdf s ; comment>Class for representing a scene from a video 
document. It is a subclass of Video_document</rdf s : comment> 

<rdf s ; subClassOf rdf ; resource= "#Video_document " /> 

</rdf s ; Class> 

<rdf : PropertyType ID= " contains " > 

<rdfs ; comment > Property related to a video asset stating that a 
video consists of a number of sequences. </rdfs : comment > 

<rdf s ; domain rdf : resource= " #Video_document " > 

<rdfs; range rdf; resource= " #Scene " > 

</rdf s ; PropertyType> 

<rdf : PropertyType ID= "duration" > 

<rdf s ; domain rdf : resource= " #Scene " > 

<rdf s ; range 

rdf : re sour ce= "http ; / /wwww . w3 . org/TR/ datatypes/#Time" /> 

</rdf : PropertyType > 

<rdf : PropertyType ID= " start Time" > 

<rdf s ; domain rdf : resource= " #Scene " > 

<rdf s ; range 

rdf ; resour ce= "http : //wwww . w3 . org/TR/ da t a types #Time" /> 

</rdf : PropertyType > 

< rdf : PropertyType ID="endTime" > 
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<rdf s ; domain rdf : resource= " #Scene " > 

<rdf s ; range 

rdf ; resour ce= "http : //wwww . w3 . org/TR/datatypes#Time" /> 

</rdf : PropertyType> 

<rdf s : PropertyType ID= "keyFrame" > 

<rdf s ; domain rdf : resource= " #Scene " > 

<rdf s ; range 

rdf ; resour ce= "http : //www . w3 . org/TR/ dat atypes# Image" /> 

</rdf s ; PropertyType> 

<rdfs : PropertyType ID="clip"> 

<rdf s ; domain rdf : resource= " #Scene " > 

<rdf s ; range 

rdf ; resour ce= "http : //www . w3 . org/TR/ datatypes#realmedia" /> 
</rdf s : PropertyType > 

<rdf s : PropertyType ID= "transcript " > 

<rdf s ; domain rdf : resource= " #Scene " > 

<rdf s ; range 

rdf : re sour ce= "http ; / /www . w3 . org/TR/WD-rdf - schema#Literal " > 
</rdf s ; PropertyType> 

</rdf : RDF> 
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Abstract. Music is an important component of digital libraries. This 
paper discusses a digital music library from the information retrieval 
viewpoint and proposes a method for extracting theme phrases. These 
are then used to present a shorter version of retrieved music to users. The 
method consists of two steps, phrase extraction and syntactical classifi- 
cation of segmented fragments of melodies. Phrase extraction is carried 
out based on a few heuristic rules. We conducted an experiment on the 
accuracy of phrase extraction using 94 Japanese popular songs and ob- 
tained 0.766 recall and 0.786 precision. The syntactical classihcation is 
based on a probabilistic syntactical pattern analysis combining classi- 
hcation and syntactical analysis. The proposed method uses a decision 
tree and a hnite state automaton and obtained 0.884 accuracy in theme 
phrase extraction. 



1 Introduction 

Music is an important component of digital libraries. Traditional libraries han- 
dle musical information using analogue records and CDs, where users need to 
search for particular pieces of music using catalogues. In a digital music library, 
it is possible to search musical information by content as well as by catalogue 
information. Information retrieval is a mature research area and IR-related tech- 
nologies for textual information are well understood. A typical retrieval scheme 
is query formulation, matching of query with objects in the database, and pre- 
sentation of results. These retrieval steps are repeated until the target objects are 
obtained. This retrieval scheme can be applied to musical information retrieval, 
but the techniques applied in each step must be adapted to musical information 
retrieval. 

In previous work, textual metadata such as composer’s name and singer’s 
name was used for queries for musical information retrieval. Researchers are 
currently interested in content-based retrieval. For example, Ghias used hum- 
ming as the query mechanism for music retrieval from an audio database [4]. 
Kaizuka proposed a retrieval system based on singing songs [6] . In this system. 
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both melodies and lyrics are used for query formulation. MIDI (Musical Instru- 
ment Digital Interface) instruments are another method for formulating queries. 
In these approaches, i.e., query by humming, singing or performing, queries are 
formulated by specifying a part of the music piece. Another approach is to use 
emotive words such as ‘beautiful’ and ‘clear’ (e.g., [10] [20]). Each piece of music 
is assigned to predefined emotive words and the query also consists of emotive 
words. 

The matching process varies depending on the query formulation. With tex- 
tual metadata and emotive words, textual information retrieval methods can be 
applied. When using parts of musical pieces as in queries by humming, singing or 
performing, we need to define similarities among melodies. In these approaches, 
melodies are represented using note sequences, and approximate matching is 
applied to find pieces of music in the database similar to the query. Ghias [4] 
and Bakhmutova [1] reported the application of approximate matching to mu- 
sic retrieval. Another approach is to construct a feature space and map both 
musical pieces and queries to points in the feature space. In this approach, the 
similarity is defined as the Euclidean distance or cosine measure between points 
in the feature space. Tsuji et al. [17] proposed a feature space consisting of tri- 
gram patterns in melodies. In this method, nine kinds of trigram pitch patterns 
were used as a feature and the feature value was defined as the frequency of 
appearance of the pattern in a musical piece. 

Retrieval results are usually presented to users in two ways: ranking the 
candidate objects and visualizing the results by plotting candidate objects in a 
low dimensional space [10] according to the similarity between the query and the 
pieces of music. Both methods are the same as those used in textual information 
retrieval except for the definition and calculation of similarity. When judging 
whether the candidate object is the required one or not, users need to listen to 
each piece of music. In this step, we need to skim through the piece of music as 
in video retrieval. Several skimming methods have been developed for texts and 
videos (e.g., [16]), but no skimming method has been proposed for music, as far 
as we know. 

We have been developing a musical retrieval system[19]. In this system, we 
plan to use a theme phrase as a representative fragment of a piece of music for 
presenting retrieval results. This paper proposes a method for extracting theme 
phrases from pieces of music. In the following sections, we first show our musical 
information system in Section 2, then explain our phrase extraction method in 
Section 3. In Section 4, we present a method for extracting theme phrases, and 
show experimental results in Section 5 where the presented method is applied 
to the analysis of Japanese popular music. Conclusions are given in Section 6. 



2 Background and Outline of a Musical Information 
Retrieval System 

There are two representational levels in musical information. One is the physi- 
cal level in which musical information is represented by sound waveforms that. 
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when transmitted to humans, are perceived as music. Many technologies at this 
level have been developed previously. Sampling technology enables digitizing of 
musical information so that it can be handled by computers. Filters using sig- 
nal processing technologies are another example related to this aspect. See the 
related chapters of the book[14] for comprehensive explanations. 

Another approach is at the logical level. This representational level is deeply 
related to the cognitive aspect of humans. It is believed that musical information 
contains various structures such as tonal structures, melodies and rhythms, and 
human beings discriminate music from other sounds based on these structural 
regularities. The logical representation must be able to represent the structures 
present in a piece of music. Several aspects of the structure of music have been 
studied in musicology. The rhythmical aspect is concerned with beats and mea- 
sures as basic components of the metrical structure. On the other hand, the 
melodic aspect is concerned with harmonic and tonal structures constructed 
from chords and chord progressions. 

There are two well-known theories for the general structure of music, a Gen- 
erative Theory of Tonal Music (GTTM)[8] and Narmour’s Theory [12]. These 
theories partly reveal the types and characteristics of music structures. A mu- 
sical score is one of the standard representations for the various structures in 
musical information. 

A digital music library needs to handle both representational levels as an 
information circulation system where musical information is acquired, stored, 
retrieved and presented. Musical information has long been circulated as sound 
waveforms on the media of analog records and compact disks. Therefore, the 
digital music library needs to acquire musical information represented as sound 
waveforms. Recent advances of music synthesis technology such as synthesizers 
and sequencers are changing the music production process. Many popular songs 
are produced using sequencers, in which musical information is represented at 
the logical level. Therefore, musical information represented at the logical level 
such as MIDI data is an alternative source for digital music libraries. In the 
retrieval phase, the music should be represented at the logical level. Although 
more research is required to know what kinds of structures are useful for the 
retrieval phase, music structure will be helpful for content-based musical infor- 
mation retrieval similar to text retrieval. Finally we need to use physical level 
representations at the presentation phase. 

A problem arises due to the requirement of handling two levels of musical 
information, namely the transformation of the data between two levels. This 
situation is very similar to digital document libraries, in which retrospective 
documents are obtained as document images from printed books, journals and 
magazines, while new documents are produced as logically tagged documents 
using technologies such as SGML and XML. Digital document libraries need to 
handle varieties of document formats. This leads to research such as information 
extraction and document image analysis to extract structures from various doc- 
uments. In order to utilize musical information represented with various formats, 
we also need methods for extracting music structures. There are many studies 
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Fig. 1. Musical Retrieval System 



on extracting information from musical scores. For example, beat tracking is one 
research area of computer music and several methods have been proposed to rec- 
ognize the beat structure of music (e.g., [5]). Tonal analysis is another example. 
Rowe [15] discusses extraction methods for various structures. 

We have been developing a song retrieval system. The goals of this system 

are: 



— to identify the structures that are useful for musical information retrieval, 

— to develop a method for extracting these structures automatically, and 

— to develop an indexing method for fast retrieval from large music databases. 

In this system, we use musical phrases as fundamental structural elements. A 
phrase is a natural concept representing a musically meaningful component. We 
believe a phrase-based system has two advantages. Firstly, it improves the ac- 
curacy of matching in musical retrieval. A user generally gives a query as a 
phrase or a few phrases. By segmenting songs in the database into phrases, we 
can handle local features of songs, and consequently, achieve high accuracy in 
matching a query and a song. Secondly, we can use phrases as concise represen- 
tations for musical retrieval. This is the main subject of this paper. This system 
uses melody for extracting phrases and features. The investigations described 
below are concerned only with melody, although we plan to incorporate the use 
of rhythm and harmony in later work. 

Figure 1 shows an outline of the system. We store songs into a database 
through the following steps: 

1. segmenting into phrases, 

2. generating a feature vector for each phrase, and 

3. adding the feature vectors to an index. 
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On the other hand, a query is transformed into a feature vector in the same 
way as songs. For the given feature vector of the query, the system finds a set 
of songs whose feature vectors are close to the query vector, and a ranked list of 
songs is presented to the user. The user selects songs from the listed result and 
listens to the theme phrases of the music until the target one is found. 

In this system, we use MIDI for representing musical information. Its repre- 
sentational level almost corresponds to a score, and it is suitable for handling 
structures of music. In the current version of the system, queries are specified 
by performing part of the target song using a MIDI instrument. In order to use 
humming as the query medium, we must incorporate a transformation of hum- 
ming to MIDI data. Similarly, the formats of songs are restricted to MIDI data. 
In order to handle songs in sound waveforms, we need to convert them to MIDI 
data. 

In this system, similarity among songs and queries is measured based on 
a melody contour that is represented with a note sequence for each phrase or 
query. Feature vectors are assigned to each phrase or query based on the patterns 
represented by n-grams of note sequences. Each pattern pi corresponds to an axis 
Ui of the feature space and the value of the axis ai of phrase or query q is the 
frequency with which pattern pi appears in q. The similarity between two vectors 
is then defined as the cosine measure of the vectors. We experimentally confirmed 
that we can improve the accuracy of matching by using feature vectors extracted 
from phrases. We made the experiments using 100 Japanese popular songs, and 
the accuracy was improved by about 10% to 20 % in the experiment. Refer to 
the paper [18] for a detailed discussion of phrase-based matching. 

In order to realize fast searches for musical information retrieval, we need to 
make an index. A multidimensional indexing technique can be applied to find 
vectors close to the query vector. Several data structures and algorithms have 
been developed to create an index and to search for vectors that are close to the 
query vector [3] . 

3 Phrase Extraction from MIDI Data 

MIDI is designed for interconnecting between instruments and computers and 
transmitting musical information. MIDI data consists of various kinds of mes- 
sages with a set of attributes to control instruments. For example, a note-on 
message contains a note number and a note-on velocity and means that a key 
is pressed on a MIDI instrument. The note number specifies a key of MIDI in- 
strument and stands for the pitch of the note while note-on velocity indicates 
how hard the key is pressed. A note-off message means that a key is released. 
The MIDI messages are sequentially located on the time axis. In our system, 
we extract phrases using melodies that are obtained from MIDI data by tracing 
the note-on and note-off messages. Pairing of a note-on message and a note- 
off message corresponds to a note. We extract a sequence of notes that have 
attribute values of pitch, velocity and length from MIDI data. For a detailed 
description of MIDI, please refer to a document such as chapter 21 in [14]. 




Music Structure Analysis and Its Application to Theme Phrase Extraction 



97 



,bu . f 






r . r 


— 




^ 




m ^ n ? n 








^77 






r 




^ T 


^ ^ r 




1 r 






n 












XI ^ ^ t. 


m 5 




1 ;; 



a be 



.bh ^ 


r- f 




1 







1* 1 


_ — r 






1 
























A — 








7 1 




m 


1 ! 










— 5 - 


i 


r\ 








1 


r 














i ^ 


n 1 


j| 

i' 

j 

\ 

i 

I 

I 


U-l , 






13 





a 



b 



c 



— - ; Submelody 
: Phrase 



Fig. 2. Structure of Melody Information 



Hereafter, we denote a melody obtained from MIDI data as a note sequence 
ni{pi,Vi,li)n 2 {p 2 ,V 2 ,h) ■ ■ - riniPn.VnJn), where Pi , Vi and k, respectively, stand 
for the pitch, velocity and length of the z-th note. 

There are several rules for identifying phrases. A cadence, derived from har- 
monic structure, is a pattern indicating the end of a phrase. There are several 
cadence patterns, such as perfect cadence, half cadence and plagal cadence. 
GTTM proposed prolongational reduction for analyzing phrase structure, in 
which a phrase consists of a tensing part and a relaxing part in this order. 
Metrical structure gives further information for phrase extraction. A bar line 
often corresponds to a boundary of a phrase. Changes of tempo and rhythm also 
give information about phrase boundaries. A rest is also a candidate of phrase 
boundary. However, we cannot apply these rules in all cases. For example, figure 
2 shows part of a song that contains two phrases. As shown in the figure, both 
phrases start inside a measure and end within a measure. Therefore, the bar line 
is not a boundary of phrase in this case. 

We use rests as delimiters of phrases. A phrase usually contains multiple 
rests, so we merge the oversegmented fragments using a set of heuristic rules 
described later. A fragment segmented by rests is referred to as a submelody. For 
example, both phrases in figure 2 contain three submelodies. Since MIDI data 
does not contain rest information explicitly, we extract submelodies by scanning 
a note sequence in the melody track of MIDI, and putting a rest where the 
duration between consecutive notes is greater than a fixed period of time. In 
this step, a note sequence of music is decomposed into a submelody sequence 
S 1 S 2 • • • Sn- Submelodies are merged using the following characteristics: 

— ruled: phrases are often repeated in a song, and 

— rule2: repeated phrases have similar melody contours. 

In order to apply these rules to phrase extraction, we need to find phrases with 
similar melody contours. For this purpose we first find the submelodies that are 
prefixes of phrases. These submelodies should have similar submelodies in the 
song, by rule one. In order to find similar submelodies, a similarity measure based 
on the longest common subsequence [7] is used. Let Pi and Pj be pitch sequences 
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of submelodies and Sj respectively, and let LCS{Pi,Pj) denote the longest 
common subsequence of Pt and Pj. Then, the similarity of two submelodies is 
defined as 



Sim{si, Sj) 



\LCS{P.,P,)\ 
max{\Pi\, \Pj\} 



The decomposed submelodies are grouped into a set of groups Gi, G 2 , • • • , Gfc 
such that Sim{si, Sj) of submelodies Si and Sj in the same group is greater 
than a threshold. Since the similarity distribution varies in each song, we need 
a procedure for adaptive thresholding. The similarity of pitch patterns of sub- 
melodies tends to have a bimodal distribution. Figure 3 shows an example of a 
histogram of similarities between any pair of submelodies in a song. We assume 
the similarity distribution is a mixture of two normal distributions 



XN{iii,al) -f (1 - X)N{fj, 2 , cr|) 



( 1 ) 



where N{fj,,a'^) stands for a normal distribution with mean /i and variance a^. 
The parameters /ii, /Z 2 , ui, (T 2 , and A are obtained by maximum likelihood esti- 
mation using the similarities of each pair of submelodies in a song. The similar- 
ity that has minimal value of the estimated mixture distribution is used as the 
threshold for submelody grouping. 

Submelodies of each group with a size of more than one are candidates for the 
prefix of a phrase. As the prefix of phrases we use submelodies in these groups 
that contain: 



— the first submelody in the song 

— the submelody that follows a rest whose length is greater than or equal to 
four beats. 

Phrases are obtained by segmenting songs at the prefix submelodies. 



4 Structural Analysis and Theme Extraction 

The features of phrases may be categorized as those on individual phrases and 
those on sequences of phrases. From the computer processing point of view, the 
former characteristics correspond to the classification problem and the latter 
to the syntactical analysis problem. Therefore, we apply an analysis method 
combining the classification and the syntactical analysis to the theme phrase 
extraction problem. This method consists of the following two steps: 

— classification of phrases and 

— validation and correction of the classification result by syntactical analysis. 

The purpose of the classification step is to know the likelihood of a given object’s 
belonging to a class, while the syntactical analysis step determines the class 
of an object using both syntactical rules and the likelihood obtained at the 
classification step. 
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Similarity 



Fig. 3. Distribution of Submelody Similarity 



Classification is a mature research area and several kinds of classification rules 
have been proposed such as decision trees, nearest-neighbor rules and neural 
networks[ll]. We use a decision tree [13] that classifies objects represented by 
a feature vector. For theme phrase extraction, we need to classify phrases into 
two classes: theme phrases and non-theme phrases, denoted as -I- and — in the 
following discussion. Theme phrases are expected to have such characteristics 
as appearing frequently in the song and having high pitchi and high strength. 
From these considerations, we use four features to compose the feature vector 
of a phrase: average pitch, average length, average velocity, and frequency of 
the phrase. The first three are directly derived from the note sequence, while 
the frequency of the phrase is defined as the size of the group that the phrase 
belongs to, where the group is created using the similarities described in the 
previous section. 

For a song and a phrase in the song, let Ap and At stand for the pitch averages 
within the phrase and over the song, and let Df be the standard deviation of 
pitch over the song. The feature value of the pitch of the phrase is then defined 
as . Feature values of duration and velocity are defined in the same way. 

As for the frequency, phrases are classified into the same class if they contain 
submelodies belonging to the same group described in section 3. Then, feature 
value of the frequency of a phrase is defined as the size of class that the phrase 
belongs to. The feature vector of a phrase is denoted as (p, d, v, f) where p, d, v 
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Fig. 4. An Example of a Decision Tree 



and / stand for the feature values of pitch, duration, velocity and frequency, 
respectively. 

A decision tree consists of nodes, edges and leaves that represent a feature, 
a condition and a class, respectively. A path from the root to a leaf represents 
a conjunctive condition. Classification is carried out by finding a path such that 
the feature vector of the object meets the required condition and the object is 
classified to the class assigned to the leaf of the path. Figure 4 shows an example 
of a decision tree. For example, the object of the feature vector (0.3, 0.2, 0.4, 0.3) 
meets the condition of the path to e. Hereafter, we use the leaf and the condition 
corresponding to the path to the leaf interchangeably. 

In this application, the purpose of the decision tree is to know the likelihood 
that an object belongs to a class rather than to determine a specific class. We 
therefore assign the list of numbers ni,n 2 , - ■ ■ ,ric to leaves instead of classes, 
where rii stands for the number of objects that meet the condition of the leaf 
and belong to class Ci in the training data. For example, the leaf e in figure 
4 indicates that a total of 47 phrases meet the condition of e, i.e., pitch > 0 
and frequency > 0.22, and seven of them are themes. Let Pr{+\e) denote the 
conditional probability that an object meeting the condition of e is a theme. 
We can then estimate this probability as 7/47 = 0.149 using the number list of 
the tree in figure 4. Similarly, we can estimate the class conditional probability 
Pr{e\+) that a theme phrase meets the condition of e. The classification step 
takes a sequence of phrases piP 2 ■ ■ 'Pn and produces a sequence of leaves hh' ■■ In 
such that the feature vector of Pi meets the condition of h. For example, suppose 
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that the feature vector sequences of phrases are 

(-0.1, 0.2, 0.1, 0.1) (-0.2,0.!, -0.1, 0.5) (0.1, -0.1, 0.2, 0.1) , 

then this phrase sequence is converted to the leaf sequence acd. 

The syntactical analysis step converts a leaf sequence hh-'-ln to a class 
sequence We apply an error-correcting parsing technique (e.g., [9]) to 

the syntactical analysis, using a probabilistic syntactical rule. In this study, we 
consider Japanese popular songs whose sequence pattern is simple enough to be 
represented by a regular grammar. Therefore we use a probabilistic finite state 
automaton to represent the syntactical rule. Figure 5 shows a simplified prob- 
abilistic automaton. The transition probabilities in the probabilistic automaton 
can be estimated by the well-known HMM learning algorithm [2] from the train- 
ing data. 

The error correcting parser uses Bayes’ decision theory. For a leaf sequence 
L = it produces the class sequence C = CiC 2 ---c„ that maximizes 

the probability Pr{C\L). This probability is equal to Pr{L\C)Pr{C) / Pr{L) by 
Bayes’ theorem. Since L is fixed in this problem, it is sufficient to find C that 
maximizes Pr{L\C)Pr{C). Assume that the class conditional probability is in- 
dependent of the position of the phrase. Then, the problem is to find C that 
maximizes 

n 

Pr{C)l[Pr{k\c,) . (2) 

For a class sequence C, the probability Pr{C) is obtained from the probabilistic 
automaton by multiplying the probabilities along the path that accepts C, while 
the probability nr=i -P^(^iki) can be estimated from the decision tree. Therefore, 
we can derive the class sequence C that maximizes (2) using the decision tree 
and the probabilistic automaton. 

For example, consider the leaf sequence acd. There are two paths pqr and psr 
that accept this sequence in the automaton in figure 5. These paths correspond 

to the class sequences h and — h -I- and the probabilities Pr{ h) and 

Pr{ — h -I-) are 0.45 and 0.01, respectively. On the other hand, let us compare 

Pr{acd\ h) and Pr{acd\ — I— 1-). It is sufficient to calculate the class conditional 

probabilities of the second phrase, i.e., Pr{c\—) and Pr{c\+) because the first 
and the third classes are the same. Pr(c|— ) is estimated as ( 25 + 13 - 1 - 10 - 1 - 15 - 1 - 40 ) 
and Pr{c\+) as ^ 26 + 7 ) from the decision tree in figure 4. From these 

probability estimates, we derive the inequality Pr{ \-)Pr{acd\ h) > 

Pr( — h +)Pr{acd\ h), and determine that only the third phrase is a theme. 

Note that the second phrase is classified as a theme if we classify it using the 
decision tree. However, we know from the syntactical rule in figure 5 that it is 
rare that the second phrase is a theme and determine that it is a non-theme 
phrase. 
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Fig. 5. Regular Patterns of Phrase Structure 



5 Experimental Results 

We applied the present method to Japanese popular music retrieval. The database 
consists of 94 Japanese popular songs in MIDI format. These songs contain a 
total of 886 phrases. 

We first applied the above phrase extraction method to the songs. We used 
two measures, recall and precision, for the phrase extraction performance defined 
as follows: 

number of extracted phrases starting with the correct boundary 

recall = - - — 

number of real phrases 

number of extracted phrases starting with the correct boundary 

precision = ^ ^ — . 

number of extracted phrases 

From 94 songs, we extracted 2961 submelodies and these submelodies were 
grouped into 1318 phrases. The recall was 0.766 and the precision was 0.786 
in this experiment. Two types of errors were observed. One is caused by the 
error of submelody extraction. The presented method uses only rests for sub- 
melody extraction and it fails to find the boundaries of phrases. Performance 
will be improved by using information about rhythms and harmonic progres- 
sions as well as rests. The other type of error was observed when more than two 
phrases always appeared in the song in the same order. In this case, our method 
recognizes the sequence of phrases as one phrase. 

For the classification, we first measured the effectiveness of each feature. In 
this experiment, we repeated the following procedures: 

— randomly allocate the prepared 886 correct phrases to 736 phrases for train- 
ing data and 150 phrases for test data; 

— for each feature, induce the decision tree using the training data and the 
C4.5 classifier [13]; and 

— measure the accuracy of the induced classifier using the test data. 
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Table 1. Effectiveness of Features 



pitch velocity duration frequency 
0.828 0.730 0.696 0.870 



Table 2. Size of the Training Data Set and Accuracy 



m 


629 838 943 1048 1132 


abs 


0.833 0.819 0.837 0.833 0.850 



Table 1 shows the average accuracy. As shown in the table, pitch and frequency 
are effective features for theme phrase extraction. From this result, we derive 
the rules: 

— theme phrases tend to have high pitch 

— theme phrases are repeated more than non-theme phrases. 

We applied the above syntactical analysis method to the extracted sequence 
of phrases. In this experiment, we repeated the decision tree induction and clas- 
sified the phrases as we did in the experiment on the effectiveness of features, and 
measured the average accuracy. We first observed how the size of the training 
data set affects the accuracy. Table 2 shows the average accuracy of classifica- 
tion only by the decision tree for various sizes of the training data set. The table 
shows that accuracies are almost independent of the size of the training data 
set. This indicates that we do not need to increase the training data for a larger 
song database. 

We also measured the accuracy of our method using the decision tree induced 
from 629 phrases, and observed that the average accuracy was 0.884, that is, we 
could improve the accuracy by about 5% by using syntactical information as 
well as features of phrases. When we applied our method to correct phrases, the 
accuracy was 0.952. The accuracy deteriorated by about 6% due to the phrase 
extraction errors mentioned above. 



6 Conclusions 

In this paper, we discussed a music digital library from the point of view of mu- 
sical information retrieval and proposed a method to extract theme phrases as 
concise representations of songs. This method consists of two steps: phrase ex- 
traction and syntactical classification of segmented fragments of melodies. Phrase 
extraction is carried out based on a few heuristic rules. We conducted experi- 
ments on the accuracy of phrase extraction using 94 Japanese popular songs and 
obtained 0.766 recall and 0.786 precision. On the other hand, the syntactical 
classification is based on a probabilistic syntactical pattern analysis combining 
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classification and syntactical analysis. A decision tree and a finite state automa- 
ton are used for the pattern analysis. The accuracy of theme phrase extraction 
was 0.884. 

For the purpose of concise representation of retrieved songs, the accuracy of 
phrase extraction is important because users feel incongruity when listening to 
part of a song that they do not perceive as a phrase. Currently, we use only 
melody information, but we plan to improve the accuracy of phrase and theme 
phrase extraction by using beat and chord information for segmentation. We 
observed that this accuracy could be improved by about 6% for the correctly 
segmented phrases in the experiment. This indicates that the improvement of 
phrase extraction is effective for theme extraction. 

In the experiment we used 94 songs. We need to gather more songs to discuss 
the effectiveness of the proposed method precisely. In the experiment, we showed 
that the decision tree was induced from a small training data set. However, we 
need more data to analyze the generality of the proposed method. The experi- 
mental results were obtained using Japanese popular songs. There are, however, 
many genres in music. For other genres we may need to use another decision 
tree and automaton. In future work, we plan to study how general the proposed 
method is and how to switch the classification and syntactical analysis rules. 

Query formulation is an important part of multimedia digital libraries. The 
current system is restricted to use of a MIDI instrument. Humming will be 
one of the simplest ways to specify queries. We are now planning to extend 
the system for handling queries by humming. Robustness is another problem in 
query formulation. It is difficult for users to give complete phrases as a query 
by performing on an instrument or humming. We need to study the errors in 
queries and improve the robustness of matching of queries and songs. 
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Abstract. We present an approach to increasing the effectiveness of ranked- 
output retrieval systems that relies on graphical display and user manipulation 
of “views” of retrieval results, where a view is the subset of retrieved 
documents that contain a specified subset of query terms. This approach has 
been implemented in a system named VIEWER (VIEwing WEb Results), acting 
as an interface to available search engines. An experimental evaluation of the 
performance of VIEWER in contrast to AltaVista is the major focus of the 
paper. We first report the results of an experiment on single, short query 
searches where VIEWER, used as an interactive ranking system, markedly 
outperformed AltaVista. We then concentrate on a more realistic searching 
scenario, involving free query formulation, unconstrained selection of retrieval 
results, and possibility of query reformulation. We report the results of an 
experiment where the use of VIEWER, compared to AltaVista, seemed to shift 
the user effort from inspection to evaluation of results, increasing retrieval 
effectiveness and user satisfaction. In particular, we found that the VIEWER 
users retrieved half as many nonrelevant documents as the AltaVista users while 
retrieving a comparable number of relevant documents. 



1. Introduction 

Information retrieval, as experienced by most Web users, is an iterative and interactive 
process which consists of submitting a query, seeing the ranked document summaries 
returned in response to the query (which may possibly lead to download the associated 
full documents), and submitting a new query, until the sought information have been 
found or the search has been abandoned. Unfortunately, the unmanageably large 
response sets of Web search engines coupled with their low precision and ranked list 
presentation may make summary perusal hard, time-consuming, and costly for the 
user. Research in information retrieval is thus increasingly focusing on the lack of 
effectiveness of current retrieval engines’ interfaces. 

S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL ‘99, LNCS 1696, pp. 106-125, 1999 
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The need for concise display and user-oriented manipulation of retrieval results has 
been recognized before the advent of Web-based ranked-output search services. 
Among various other systems, Bead [7] and LyberWorld [12] depicted clustering 
patterns in a document space using three-dimensional visualization schemes, 
InfoCrystal [16] used a particular visual representation of a Venn diagram to suggest 
how to refine Boolean queries, TileBars [10] displayed distribution of query terms 
within each document to locate its relevant parts, and Ulysses showed a lattice of 
terms and documents that can be searched in various and integrated ways [6]. Most of 
these systems, however, cannot be applied to Web-based retrieval because they are 
either computationally expensive, or require sophisticated graphical facilities, or do 
not scale well, or rely on underlying retrieval models other than best-match ranking, 
or, more often, present a combination of these features. 

The goal of our research is to facilitate inspection and utilization of Web retrieval 
results. Some of the more stringent requirements of Web searches such as presentation 
of results by summaries, interaction with nonexpert users, speed, and incrementality 
have been specifically addressed only very recently. Two examples are the works by 
Zamir and Etzioni [21], in which they described a cluster-based method for reordering 
and labelling the first documents returned by Web engines, and by Tombros and 
Sanderson [18], where they suggested using a query-biased document description to 
better support the users’ need to refer to full documents. Instead of presenting the user 
with a different list of summaries or with a list of summaries with a different 
document description, we aim at giving the user more control over the set of 
summaries that can be selected for perusal. 

We present a graphical interface to Web search engines that displays characteristics 
of documents which are significant in supporting the decision to peruse or not. This 
acts as an intermediate layer between the query specification stage and the actual 
display of the document summaries; the latter takes place only on user demand after 
interaction with the intermediate layer. Our approach is based on the notion of view, 
where a view is simply defined as the subset of retrieved documents that contain a 
specified subset of query terms. Similar to other recent exact matching retrieval 
systems that will be discussed below, the main rationale of our approach is that the 
selection of documents of interest can be facilitated by decomposing a query into its 
constituents and checking for their inclusion in a document individually. 

A major part of this paper is then a study on the retrieval effectiveness of this kind 
of component for on-line interactive searches. We evaluate the performance of using 
the view mechanism to select document summaries in contrast to their conventional 
ranked presentation directly following a user query, as in Web search engines. We 
experiment over a large test collection with external subjects, considering both single- 
query searches and multiple-queries searches. This kind of experiments has been 
regrettably rare in the literature on user-oriented visualization and manipulation of 
retrieval results, probably due to a combination of technological, organizational, and 
economical difficulties; we feel that this gap should be filled in order to assess the 
utility of these tools in a more realistic manner. 

The rest of the paper is organized as follows. We first discuss the renewed interest 
for exact matching retrieval and describe the system prototype in which our approach 
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has been implemented, named VIEWER (VIEwing WEb Results). Then we present 
the experimental study, which consists of two distinct experiments. In the first 
experiment we compare the ranking performance of AltaVista to that of VIEWER, 
used as an interactive ranking system, on single, short query searches. In the second 
experiment we compare the performance of VIEWER to that of AltaVista on a more 
realistic subject searching task, involving free query formulation, free inspection and 
selection of results by users, and possibility of query reformulation. For this task, we 
use evaluation measures related to document relevance, as well as to time and 
subjective opinions of the users. A discussion of some general lessons learned from 
these experiments along with directions for future work conclude the paper. 



2. Exact Matching Retrieval 

Consistently with earlier results on the effectiveness of information retrieval systems, 
Web search services have heavily favored best matching over other types of document 
retrieval. However, some specific requirements of Web-based retrieval challenge this 
view. First of all, while we are often primarily interested in precision rather than 
recall, there is evidence that best matching retrieval achieves lower precision ratios 
than exact matching retrieval for large databases, and that this difference increases as 
databases grows [2]. Secondly, while best matching retrieval is designed to take 
advantage of the presence of many query terms to describe a user’s information need, 
the average number of user-supplied query terms in Web searches is usually very 
small, often less than 2. Given this scenario, exact matching retrieval is seen with 
renewed interest, both as an alternative or as a complementary technique to traditional 
best matching retrieval. 

One of the most interesting recent findings that supports the use of exact matching 
retrieval was presented by Clarke et al. [8], who showed that a variant of the well 
known coordination level-based retrieval method may achieve not only better 
precision but also better recall-precision performance than best match ranking when 
the user queries are short.' This result is particularly impressive considering that since 
coordination level-based retrieval uses a purely syntactical ranking criterion, it fails to 
recognize all the situations in which documents containing fewer query terms are 
more relevant than documents containing more query terms. This usually happens 
when a short (exact) partial match between query and documents is found that closely 
corresponds to the query concept while there are other longer (exact) partial matches 
that are less “about” the query concept. As a result, coordination level-based retrieval 
may easily favour spurious or irrelevant matches over relevant ones, thus lowering 



* Using coordination level-based retrieval the documents are ranked according to the number of 
distinct query terms that they contain, which is referred to as their coordination level (see for 
instance Van Rijsbergen [19]). If the user query contains n terms, the documents that contains 
n query terms are ranked before those containing n-1 terms, which, in turn, are ranked before 
those containing n-2 terms, and so on. This method is also termed quorum-level searches 
[15]. 
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precision with many false drops. In order to improve the effectiveness of (syntactic) 
keyword retrieval we must therefore deal with the (semantic) problem of 
discriminating between proper and improper partial matches between query and 
documents.^ 

To deal with this problem we may use contextual information on the query terms. 
Bodoff and Kambil [3], for instance, suggested using cataloger-provided dependencies 
between the subject terms of the documents being searched. Another way of obtaining 
contextual information is to have the user express Boolean constraints over the set of 
query terms to indicate which terms cover which aspects of the query, e.g., the 
constraint (A OR B) AND (C OR D) specifies two aspects of the query, each 
represented by two keywords [11]. These approaches may be useful to solve specific 
aspects of the exact partial match problem but the formulation of the additional 
information that they require lends itself to improper partial matches. Furthermore, the 
specification of query filters takes place before searching the database, while the 
relevance of a partial match depends also on the content of the database being 
searched. 

A semi-automatic solution to the partial match problem is to present the user with 
information that highlight the distribution of the possible various meanings (arising 
from partial matches between the query and the documents) in the documents 
themselves, and then let the user select the documents containing the meanings of 
interest. This approach has been taken by Veerasamy and Heikes [20], with the main 
goal of clarifying the role played by query terms in the result of ranked output 
systems. They visualize the weights of the query terms contained in all documents 
retrieved in response to a query and then let the user choose the documents that 
contain the most relevant combinations of weighted terms. Our approach shares a 
similar concern but employs a radically different visualization and interaction scheme. 
Instead of visualizing the weights of the query terms of each retrieved document we 
concentrate on all the possible subsets of query terms (i.e., subqueries) that can be 
generated from the user query, showing their distribution in the set of retrieved 
documents and allowing the user to select the set of documents associated to each of 
them. We speak of view, because in this way the user may see parts of results without 
seeing the whole list. Views are defined in a precise way from the retrieved 
documents through a simple and comprehensible characteristic of their content, i.e., 
the subset of distinct query terms that they contain. 



3. Description of VIEWER 



VIEWER is built around available “primary” Web search services, presenting users 
with a single unified interface. Users enter a query, which VIEWER forwards to a 
selected search engine (AltaVista, in the current implementation). VIEWER then 



^ Bodoff and Kambill [3] identified several types of out-of-context matches (i.e., when some 
query terms match out of context of their relationships to the other terms), including 
polysemy, out of phrase terms, secondary topic keyword, and non-categorical terms. 
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collects the query results and shows, in a scrollable window, a subset of the document 
summaries, in the same ranked order as returned by the search engine; in addition, it 
shows, in the rest of the screen, a graphical visualization of results. The visualization 
consists of an aligned sequence of horizontal bars, one for each of the nonempty 
subqueries (at most ( 1 ) 2"-l )that can be formed with the n query terms. Subqueries 
are displayed in the order of increasing number of terms, with the longest subqueries 
at the bottom; the length of each bar is proportional to the number of document 
summaries containing that subquery, which is also explicitly displayed next to the bar. 
By clicking on a bar the user may select the corresponding view, bringing up the 
associated summaries into the document window. The number of retrieved document 
summaries considered by VIEWER is currently set at 40; the maximum number of 
query terms used for visualization is set at 4, because the display of the subquery 
distribution in the retrieved documents would become cumbersome for longer queries. 
In practice, however, the latter is not a serious limitations due to the paucity of query 
terms in real searches. 

As an example session with VIEWER, consider searching the following subject 
over the Web: “scientific accuracy of Bible predictions”. Figure 1 shows the response 
of VIEWER to the user query: scientific accuracy Bible predictions, as of 25 March, 
1999. The document summaries returned by AltaVista are shown on the right window. 
The graphical display quickly shows that the results produced by AltaVista were, in 
general, dissatisfying, because most retrieved documents did not deal with the Bible. 
What happened was that some subqueries (e.g., “predictions accuracy”, “scientific 
accuracy”, or just “predictions”) matched out of the context of the primary topic 
subquery (i.e., Bible), and retrieved documents about such diverse domains as 
currencies, weather, physics, and astrology. What is more important, the non-relevant 
retrieved documents were ranked by AltaVista well ahead of the relevant ones. In fact, 
the documents about “scientific accuracy of the Bible” were ranked by AltaVista from 
30 to 40, and thus a user would have probably completely missed them in a normal 
search. Using VIEWER, instead, the user may immediately select those few 
summaries that appear to be relevant, without perusing the others. In addition, the 
results displayed by VIEWER suggest that the user might profitably reformulate the 
query by emphasizing the primary subject of the search (e.g., adding such terms as 
biblical, Christian, religious) and by not using the subqueries that matched out of 
context (e.g., replacing predictions by prophecies). 

The scope of VIEWER encompasses a number of situations where the retrieval 
results can be usefully related to a query’s constituents. The questions that can be 
quickly answered with VIEWER include: how many documents contain a certain 
subquery si, which documents contain si, which terms can be added to (deleted from) 
5 in such a way that the resulting set of documents is more (still) manageable - when ^ 
is contained in too many (few) documents?. We might also be interested in the 
relationship between different subqueries. So we might ask: what is the contribution of 
subquery q compared to subquery si, does subquery q occur more frequently alone or 
in conjunction with si, and so on. In addition to enriching inspection of retrieval 
results with facilities for selection, comparison and refinement involving groups of 
query terms, VIEWER has also potentials for facilitating query reformulation, as seen 
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in the example. In particular, VIEWER may suggest that in order to focus on the 
intended meaning of the query, the user should reformulate the query by adding 
(deleting) terms to (from) it or by replacing some terms with narrower/broader terms. 
VIEWER may also help detect failure of intended senses of words, i.e., when two 
terms used in the query to identify one particular meaning do not occur together in the 
retrieval results, or, symmetrically, discover unwanted senses of words [9]. VIEWER 
has been implemented as a client-server system; its user interface is a Java applet 
which can be downloaded on a Web browser from.- http://www.fub.it/viewer/. Thus, 
VIEWER copes with most computational constraints of Web-based retrieval (e.g., 
efficiency, portability, adaptability) that are not usually addressed in other document 
visualization systems. A more detailed description of VIEWER including architectural 
aspects and more Web session examples can be found in [1]. 
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Fig. 1. Visualization of Web results for the topic: “scientific accuracy of Bible predictions”. 
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4, Experiment 1: Comparing VIEWER and AltaVista on Single- 
Query Searches 

4.1 Goal 

The goal of the experiment was to evaluate the effectiveness of VIEWER in 
reordering the set of documents retrieved by a ranked output retrieval system in 
response to a query. We compared the performance of VIEWER, used as an 
interactive ranking system, with that of AltaVista, which produced the ranked 
document list given as input to VIEWER itself. 



4.2 Subjects 

We tested ten subjects in this experiment. The subjects were recruited in our institute; 
they had a computer science background and good knowledge of English. Each 
subject was provided with a short tutorial session on a training database to ensure that 
he or she could easily manipulate the interface used in the experiment. 



4.3 Database and Queries 

We did not perform subject searching over the Web because it would be difficult to 
assess a system’s retrieval effectiveness and do comparative studies in this 
unrestricted domain without biasing the results. Rather, we used a standard large IR 
corpus, containing a set of predefined topics and their associated relevance 
judgements. We experimented over the TREC4 test topics (201-250) and test 
collection, consisting of disks 2 and 3 (approximately 2 Gigabytes of text) of the 
TREC/Tipster collection. We chose the TREC4 topics because they are short (a one 
sentence topic description) and hence may better reflect an interactive situation. Each 
topic was manually transformed into a three to four term query by selecting terms 
from the topic. The set of queries generated this way had an average length of 3.9 
terms; their complete description is available from: 
http://www.fub. it/viewer /queries, txt. 



4.4 Implementation of Ranking Systems 

AltaVista can be used for global Web search as well as for indexing and searching site 
specific information. We connected an AltaVista server to the test collection and 
executed the queries, determined as explained in Section 4.3, against the 
corresponding AltaVista database. The result was the ranked list used as a baseline in 
the comparison with VIEWER. 

The implementation of VIEWER as a ranking system was not straightforward, 
since VIEWER cannot produce a document ranking by itself We designed an 
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interactive procedure that worked as follows. Each subject was shown the topic, the 
query extracted from it, and VIEWER’S visualization of the distribution of query 
terms in the full-text documents retrieved by AltaVista in response to the query. Then 
the subject was asked to choose a sequence of views (without seeing the document 
summaries) by repeatedly selecting one of the views offered by VIEWER until all 
views had been selected. The documents were thus ranked according to the order 
chosen by the user to select the views. ^ Documents contained in more views were 
ranked based on the earliest view in which they occurred. In this way we obtained a 
partly-ordered retrieval output; we further ranked the (equally-ranked) documents 
within each view by using the ranking produced by AltaVista for those documents. As 
a result of this process, the final ranking built by the user corresponds to a particular 
sorting of the documents contained in the output returned by AltaVista. Each subject 
took about one and a half hour to execute the 50 queries. 



4.5 Results 

The results are displayed in Figure 2. The precision-recall curve was normalized 
considering, for each query, only the relevant documents that contained at least one 
query term; i.e, those that were actually retrieved and ranked by the two methods. 

The figure reports interpolated precision at eleven recall levels, averaged over the 
50 queries; the results of VIEWER were averaged over the ten subjects. The 
performance improvement was therefore apparently consistent at all recall points, and 
the differences were statistically significant (with a combined p value of 5.35E-05). 
These results confirm and extend earlier findings obtained on two small test 
collections [1], and offer strong evidence that VIEWER can be effectively used by a 
user to reorder the documents returned by Web search engines, at least for short 
queries. 

This is a useful starting point to evaluate the utility of VIEWER, but its scope is 
limited by the fact that what we measured is a theoretical, rather abstract, aspect of 
performance. In practice, the operational conditions are very different, because users 
do not examine the whole set of documents retrieved in response to a query and 
because they usually question the system with several queries. This issue is addressed 
in the next section. 



^ It should be noted that the order of views chosen by the user was very different from the one 
that would have been produced by an automatic system based on coordination level. 
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Fig. 2. Comparison of AltaVista and VIEWER on short queries 



5. Experiment 2: Comparing VIEWER and AltaVista on Multiple- 
Query Searches 

5.1 Goal 

The goal of the experiment was to evaluate the effectiveness of VIEWER in contrast 
with AltaVista in a realistic search situation, involving free inspection and selection of 
results by users and possibility of query reformulation. In particular, we wanted to test 
the hypothesis that VIEWER allows the user to focus on the relevant document 
summaries obtained in response to a query, without selecting the irrelevant ones, and 
that VIEWER helps user reformulate a query. 

5.2 Subjects 

We tested twenty subjects in the experiment. The subjects were undergraduate 
students at the University of Rome with the following main characteristics 
(ascertained through a pre-test questionnaire): basic computer experience, some 
familiarity with on-line document searches, and a good knowledge of English. Sixty 
dollars was paid to each subject for his participation. 
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5.3 Databases and Topics 

We used the same test collection as in the first experiment (TREC4) because its short 
topics favour the use of short queries and because we can consider our results to scale, 
given the size and the characteristics of the collection. In addition, using the same 
collection allows cross-experiment comparison. For this experiment we used only six 
topics, randomly selected from the 50 TREC4 test topics: 

• Topic 202: Status of nuclear proliferation treaties: violations and monitoring. 

• Topic 204: Where are the nuclear plants in the U.S. and what has been their rate of 
production? 

• Topic 215: Why is the infant mortality rate in the United States higher that it is in 
most other industrialized nations? 

• Topic 221: Steps taken by church, governments, community, civic organizations to 
halt carnage among youths engaged in drug or gang warfare. 

• Topic 222: Is there data available to suggest that capital punishment is a deterrent 
to crime? 

• Topic 240: What controls, agreement, technological advances or equipment are 
now in use or planned to assist in combating terrorism? 



5.4 Implementation of Retrieval Systems 

We minimized as much as possible the effect that having different interfaces has on 
performance. The interfaces to the two retrieval systems were implemented as Java 
applets; they ran on the same machines (PC’s) and used many identical interaction 
devices such as the topic window, the query formulation facility, and the document 
summary display-and-evaluation facility. The interface used to test VIEWER was like 
Figure 1; the only difference was that it also displayed the topic being searched. The 
interface used to test AltaVista was like that of VIEWER, except that the left lower 
region of the screen, containing the VIEWER’S display stage, was empty. 



5.5 Experiment Design 

In our experimental design the independent variable is the mechanism for selecting the 
document summaries, which may be based either on VIEWER or on AltaVista. The 
test comprises two tasks that a group of subjects will have to perform: to retrieve 
summaries that are relevant to the given set of topics using either VIEWER or 
AltaVista. The dependent variable is the performance of a group of subjects in these 
tasks, whose variation may be attributed to the change in the level of the independent 
variable provided that other biasing factors are kept under control [13]. Our 
experimental setting attempted to ensure this condition. 

To assign tasks to subjects we used an independent subject design [14]. The 20 
subjects were randomly split into two groups with 10 subjects, and each group was 
assigned to one experimental condition only. The instruction for the task were given in 
a manner similar to interactive track specification at TREC: “find as many good 
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document summaries as you can for a topic, in around 20 minutes, without collecting 
too much rubbish”. The subjects were asked to label as relevant/nonrelevant each read 
summary; they did not have access to full-text documents, because this would have 
distracted them from the main focus of the experiment, i.e., the selection of relevant 
summaries. 

During each search, the subjects using AltaVista could formulate more than one 
query, seeing the summaries obtained in response to each query on a scrollable 
window. The subjects using VIEWER could also formulate more queries, but in order 
to see the summaries they had to select some view first. To be more precise, after 
submitting a query the subjects were not presented with a list of summaries, as with 
AltaVista, but with VIEWER’S visualization stage relative to (at most) the first 150 
summaries retrieved by AltaVista itself At this point, they could select one or more 
views and read the summaries associated with each of them. 



5.6 Experiment Operationalization 

The experimental sessions took place in the “Human factors” laboratory at Fondazione 
Ugo Bordoni and lasted four days. We tested five subjects at a time, supervised by the 
same experimenter; each subject performed the task alone, in an acoustically-isolated 
room, with a video camera recording the session. The subjects were provided with a 
tutorial session of about an hour, including a search on a training topic. Great attention 
was paid to ensure that at the end of the training they could easily manipulate the 
interface for specifying queries and seeing summaries, including the view mechanism. 
Then they did the six searches, with a five minutes break between one search and 
another. Twenty minutes were allocated for each search, although the subjects were 
allowed to give up at any time after 15 minutes. After each search, they completed a 
search evaluation questionnaire. At the end of the experiment, a more elaborate 
questionnaire on their use of the system was administered. The experimental sessions 
were fully digitized: the topic to be searched automatically appeared on the screen and 
all questionnaires were filled out by computer. 



5.7 Performance Measures 

Since there are no established performance metrics that measure the effectiveness of 
interactive information retrieval systems, especially when visualization of retrieval 
results is involved, it is advisable to use different evaluation scenarios and measures. 
In particular, it seems useful to try to extend conventional measures of batch retrieval 
to the interactive context by taking also into account the dynamics of retrieval sessions 
[6], [5] and the user’s opinions [4]. We focus on precision (i.e., the ratio of number of 
items retrieved and relevant to the number of items retrieved), because this is usually 
the primary concern for users engaged in on-line interactive searches. One definitional 
problem with batch measures, including precision, is that they are based on the notion 
of retrieved document, which is often difficult to define in an interactive setting. One 
approach [20] is to consider as retrieved documents only the documents that have been 
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seen and judged to be relevant by the user, but this approach has the disadvantage that 
we might be measuring the users judgement more than the effect of the retrieval 
method. A more typical choice (e.g., [17], [6]) is to rate a document as a retrieved 
document as soon as its description (the summary, in our case) is recovered, without 
considering the user’s judgement. In this experiment we have taken the latter 
approach; i.e., all summaries labeled by a subject in an experimental session, whether 
subjectively labeled relevant or nonrelevant, were considered as retrieved by the 
system. Then, in order to consider the dynamics of the session, we measure how the 
precision of the interactive retrieval varies as a function of retrieved documents and 
time. 

These objective measures are complemented with the opinions of the users 
gathered from the questionnaires. In the questionnaires we focused on three main 
variables: user satisfaction, utility of the system, and, for the subjects using VIEWER, 
the usage of views. For each variable we considered a number of aspects, e.g., for user 
satisfaction we measured interestingness, effort, and fun. For each aspect, the subjects 
were presented with a five-point rating scale, using both Likert scales and semantic 
differentials [13]. For instance, user satisfaction’s aspects were measured through a 
semantic differential with three pairs of bi-polar adjectives (boring-interesting, 
difficult-easy, and unpleasant-pleasant). 



5.8 Experimental Results 

We present the results according to the type of variable measured and not to how it 
has been measured (i.e., objective versus subjective assessment). The section is split 
into two parts: results about the relative performance of the two systems, and results 
about the usage of views with VIEWER. 

5.8.1 Performance Comparison 

Table 1 shows the average number of summaries per topic retrieved by the two 
systems (partitioning the summaries in relevant and nonrelevant) along with the 
average precision. Using AltaVista, the users retrieved many more summaries than 
with VIEWER, but the number of retrieved relevant summaries was very similar for 
the two systems, which means that the AltaVista users retrieved many more 
nonrelevant summaries. In fact, the precision of VIEWER was markedly better than 
AltaVista. 



Table 1. Performance comparison at the end of search 





Retrieved Rel 


Retrieved nonRel 


Retrieved Total 


Precision 


AltaVista 


7.1 


53.9 


61 


0.116 


VIEWER 


6.4 


24.75 


31.2 


0.206 
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The value of precision shown in Table 1 refers to a complete search, and thus 
ignores the dynamics of interaction. 




Fig. 3. Interactive precision as a function of retrieved documents 

Figure 3 shows the precision of the two systems as a function of the number of 
retrieved summaries. The results were averaged over the topics and the subjects. The x 
axis in Figure 3 is restricted to 26 because this was the minimum number of 
summaries per topic retrieved by the subjects. The two curves show a similar 
behavior: the precision initially increases until it reaches a peak after which it declines 
rather gracefully. This suggests that in an interactive retrieval setting the most relevant 
documents may not be the very first retrieved documents, as in automatic ranking 
systems, but those retrieved right after the first ones, probably as soon as the subjects 
tune in to the search domain and the search facilities. The results of Figure 3 show that 
the subjects using VIEWER obtained markedly better precision values than those 
using AltaVista. 

Figure 3 tells us with which accuracy relevant summaries have been retrieved, but 
it does not say when they have been retrieved. This is an important piece of 
information because we might be more interested in a system that has a lower overall 
precision but is faster in retrieving a few relevant documents. Figure 4 shows how the 
precision varied as the search time increased. We restricted the search time to 15 
minutes because this was the minimum search time actually taken by the subjects. 
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Time (Minutes) 



Fig. 4. Interactive precision as a function of time. 




Time (Minutes) 



Fig. 5. Number of retrieved documents (both rel. and nonrel.) as a function of time. 

Similar to Figure 3, Figure 4 shows that the precision of both systems quickly rises 
and then gracefully decreases as the session progresses. Figure 4 also shows that the 
precision of VIEWER was better than AltaVista at almost all time points. As in Figure 
3, the differences appear to be relevant, although we must be cautios to generalize 
these results to different sets of topics due to the limited number of topics used in the 
experiment. Taken together, the results of Figure 3 and Figure 4 suggest that the 
retrieval of summaries occurred rather uniformly throughout the session and across 
systems. That this was actually the case is shown in Figure 6. 

We should emphasize that while in the above results about precision we used the 
“objective” TREC’s assessors relevance judgements, we also computed the same 
curves using instead the subjective relevance judgements expressed by the subject 
during their search. The results, not shown due to space limitations, were very similar 
to those found with objective judgements. This represents additional important 
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evidence in favour of our approach because in interactive information retrieval it is 
likely that user judgements, although less “objective”, better reflect the context in 
which the search took place and its dynamic nature. 

As mentioned in Section 5.7, we measured also the user satisfaction and the utility 
of the system as perceived by the subjects involved in the experiment. For each of 
these two variables, we converted the rating scales into numerical values and 
computed the mean over the subjects and the aspects measured. The results are 
depicted in Figure 6, where the scale ranges from 1 (least helpful) to 5 (most helpful). 
As shown in Figure 6, the subjective opinions of the users provide further evidence 
that VIEWER performed better than AltaVista, both with respect to utility and user 
satisfaction. For the latter variable, according to the user ratings VIEWER was slightly 
more difficult but much more interesting and pleasant than AltaVista. 

Before concluding this section it is useful to look at how an automatic single-query 
retrieval system would fare on the same topics we used in the interactive multiple- 
query searches. It turns out that the retrieval effectiveness of an automatically- 
generated ranked document list would be much worse than that obtained with 
interactive retrieval. These are indeed difficult topics for an automatic system, because 
most relevant documents do not match the topic keywords. For instance, for topic 204, 
221, and 240, AltaVista would retrieve no relevant document at all in the first 50 
documents returned in response to a complete topic statement. 




Alta Vista (□) VIEWER 



Fig. 6. User ratings for AltaVista and VIEWER. 



5.8.2 Views Usage 

On the whole, the VIEWER users submitted 500 queries and selected 693 views 
(about 1.4 views per query, or 1 1.5 per session). Table 2 shows for how many queries 
0, 1,2, or more than 2 views were selected. According to these results, most of the 
times the users selected a very limited number of views, thus reading only a small 
subset of the set of summaries returned in response to a query. In the limit, for 29% of 
the times, the subjects selected no views at all, which implies that they decided to 
formulate a new query by just looking at the view display, without reading any 
summary. 
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Table 2. Distribution of the number of views. 



Number of views 


Number of queries 


Percentage 


0 


146 


29% 


1 


222 


44% 


2 


62 


12% 


>2 


70 


15% 


Total 


500 


100% 



The users preferred views with more terms, which were selected more often and 
were usually the users’ first choice. Table 3 gives the number of times a view of a 
given length was selected as a first choice, as a second choice, and in total: users 
selected the three or four term views 45%+20%=65% of the times; the first selected 
view was usually one with 3 or 4 terms. We also found that the average length of the 
selected views usually decreased as more views were selected by the users. Table 4 
shows that the first view selected was usually (72% of the times) the longest possible 
(i.e., the one with 4 terms, or the one with all the terms in the query if the query 
contained less than 4 terms). However, it is also important to note that for 100%- 
72%=28% of the times (see Table 4 again), the first view selected was not the longest 
possible: some other reasons (e.g., semantics of terms, size of associated set of 
documents) induced the users to select another view, usually one among those with the 
maximum length minus one. 



Table 3. Distribution of the length of views. 



View length 


1st selected 


2nd selected 


Total 


1 


8 


2% 




4% 


31 


5% 


2 


87 


24% 


46 


35% 


208 


30% 


3 


140 


40% 


73 


55% 


314 


45% 


4 


119 


34% 


8 


6% 


140 


20% 




354 




132 




693 





Table 4. Distribution of the relative length of views (wrt query length). 



Relative view length 


Number of selections 


Percentage 


Max 


256 


72% 


Max - 1 


81 


23% 


Max - 2 


16 


4.5% 



The subjects were asked, through the questionnaires, to rate the importance of the 
criteria used to choose one view rather than another. The results, shown in Figure 8, 
suggest that the users made use of all major displayed clues about the relevance of 
retrieved summaries. This indication was also confirmed by other data gathered from 
the questionnaires, where the subjects positively rated their full understanding of the 
view mechanism. 
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Fig. 7. User ratings of criteria used for view selection. 

One of the objectives of the experiment was to test the hypothesis that VIEWER 
supports query reformulation. The results confirm the validity of our hypothesis, as 
demonstrated by both the number of queries and the query length (number of terms in 
each query). The average number of queries per session was higher for VIEWER (8.3 
versus 6.7), and Figure 8 shows that the average number of queries per minute was 
higher for VIEWER during almost the whole session. 




Fig. 8. Average number of queries during a search session. 

The average query length was also higher for VIEWER (3.8 versus 3.1). Figure 9 
shows that average query length slightly increased as more queries were submitted by 
the users, for both AltaVista and VIEWER. VIEWER stimulated longer queries from 
the beginning of the session (the length of the first submitted query was 3.3 for 
VIEWER and 2.85 for AltaVista), and this difference remained approximately 
constant for the rest of the session. As long queries are customarily difficult to 
formulate for real users, this can be taken as an indication that VIEWER supported the 
subjects in doing so. 

The subjects opinions were consistent with these results. The users declared that 
they sometimes used views for adding, removing, or modifying terms in the query. 
The subjects also declared that: views were easy to learn (average score 5 on a 1-5 
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scale), they were useful for query reformulation (4.3), they improved search 
effectiveness (4.7), they allowed to spare time (4.6), and they did not hinder the search 
(4.9). 

Taken together, the results of Figure 5 and Figure 8 suggest that when passing from 
AltaVista to VIEWER much of the user effort shifted from inspection to evaluation of 
results. Using AltaVista, the users formulated fewer queries and retrieved more 
results, which implies a prolonged direct inspection of the results obtained in response 
to a query. With VIEWER it is just the opposite: the users retrieved fewer results and 
formulated more queries. This suggests that they were engaged in an accurate view- 
based evaluation of the results, which decreased the amount of results actually 
inspected in response to a query and spurred the formulation of new queries. This 
observation is also confirmed by the results of Table 2, which show, as already 
remarked, that the subjects sometimes formulated a new query without retrieving any 
result of the current query. 




1 23456789 



Query number 



Fig. 9. Distribution of query length in the search session. 



6, Conclusions 

We took the view that users engaged in information searches may be willing to use 
more clues about the relevance of retrieved documents. We showed the feasibility of 
using the view mechanism to give user more control over display and manipulation of 
retrieval results. In particular, from our experimental evaluation, two main 
conclusions can be drawn. 

• The view mechanism allowed users to select relevant results with a higher 
precision, reducing the burden of collecting nonrelevant results. 

• The view mechanism shifted the user effort from inspection to evaluation of 
retrieval results, increasing the number and the length of the submitted queries and 
increasing the user satisfaction. 
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One important design parameter of VIEWER is the amount of textual information 
extracted from the retrieval results and used to compute the views. If a centralized 
repository with all accessible documents is available, one can use a relatively high 
number of retrieved documents along with their full-text descriptions without 
sacrificing the response time of the interface. This was the case in our experiments. 
However, for ubiquitous searches on the Web, it may be necessary for efficiency 
reasons to use only a very limited number of the documents retrieved by the engines 
and to use the short document summaries, as provided by the engines, without 
downloading the full-text descriptions. The current version of VIEWER manages to 
keep the computational overhead small by using only a few tens of retrieved 
summaries. Of course, using only a small fraction of the amount of textual 
information theoretically available may affect the retrieval effectiveness of the view 
mechanism. While there are some recent results that suggest that this may not 
necessarily be the case [21], an exploration of the main trade-offs involved here (e.g., 
efficiency versus effectiveness, centralized versus distributed) is an issue for future 
research. 
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Abstract. Techniques for automatic query expansion from top retrieved 
documents have recently shown promise for improving retrieval effectiveness 
on large collections but there is still a lack of systematic evaluation and 
comparative studies. In this paper we focus on term-scoring methods based on 
the differences between the distribution of terms in (pseudo-)relevant 
documents and the distribution of terms in all documents, seen as a complement 
or an alternative to more conventional techniques. We show that when such 
distributional methods are used to select expansion terms within Rocchio’s 
classical reweighting scheme, the overall performance is not likely to improve. 
However, we also show that when the same distributional methods are used to 
both select and weight expansion terms the retrieval effectiveness may 
considerably improve. We then argue, based on their variation in performance 
on individual queries, that the set of ranked terms suggested by individual 
distributional methods can be combined to further improve mean performance, 
by analogy with ensembling classifiers, and present experimental evidence 
supporting this view. Taken together, our experiments show that with automatic 
query expansion it is possible to achieve performance gains as high as 21.34% 
over non-expanded query (for non-interpolated average precision). We also 
discuss the effect that the main parameters involved in automatic query 
expansion, such as query difficulty, number of selected documents, and number 
of selected terms, have on retrieval effectiveness. 



1 Introduction 

Experience with operational search systems reveals a significant mismatch between 
their theoretical assumptions and the actual user behavior. While these systems are 
designed to take advantage of the presence of many query terms to describe a user’s 
information need, the average number of user-supplied query terms is usually very 
small, often less than 2. The paucity of query terms exacerbates well known inherent 
limitations of information retrieval systems, such as the difficulty of recovering from 
word mismatch between queries and documents, and it may represent a fundamental 
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practical limitation for effective retrieval from large databases. Much of the current 
research in information retrieval attempts to solve this problem by focusing on 
methods for the creation of a query „context“ by using such diverse knowledge 
sources as user’s relevance feedback [13], thesauri [8], and conceptual clustering of 
documents and terms [6], rather than concentrating on better ways of matching 
queries against documents. 

One well known, automatic approach to adding contextual information to user 
queries is based on the extraction of useful terms from the top retrieved documents, 
which is also referred to as retrieval feedback or pseudo-relevance feedback. While 
this technique did not, historically, work well, due to losses in precision being higher 
than gains in recall, it has recently received renewed attention for its successful 
application to large scale collections (e.g., [4], [25], [12], [17]). In the TREC 
environment, for instance, more recently almost all groups have been using variations 
on expanding queries using information from the top retrieved documents, but the 
benefits of different query expansion techniques have been usually evaluated with 
respect to using non-expanded query and not by cross-system comparisons. The 
growing interest in pseudo-relevance feedback calls for a more careful and systematic 
evaluation of competing approaches and for a better understanding of their relative 
strengths and weaknesses. 

In this paper we focus on term-scoring functions that are based on the differences 
between the distribution of terms in (pseudo-)relevant documents and the distribution 
of terms in all documents. We consider several instances of this general 
„distributional“ approach, including Robertson Selection Value [18] as well as 
statistical and information- theoretic functions. We study how to use these 
distributional functions to improve effectiveness of automatic query expansion. We 
first analyze whether distributional functions can effectively complement more 
conventional reweighting methods such as Rocchio’ s formula by selecting the terms 
to be used for query expansion. The results are negative. We then use the same 
distributional methods to select and weight expansion terms, this time showing 
considerable performance improvement over Rocchio ’s formula, with or without term 
selection based on distribution analysis. 

The results of the latter experiment encourages a deeper query-by-query analysis. 
We learn that while the distributional methods may achieve comparable mean 
performance, they may also present large variations on individual queries both on the 
ranked set of suggested terms and on the retrieval performance. This observation 
suggests using combination strategies, by analogy with ensembling classifiers in the 
machine learning field. We present a simple approach to combining the results of 
multiple distributional methods and show that the combined method may perform 
better than the individual methods, thus further increasing the performance 
improvement of expanded query over non-expanded query (up to 21.34% for non- 
interpolated average precision). We finally study how the retrieval performance varies 
as a function of the main parameters involved in automatic query expansion, 
including query difficulty, number of selected documents, and number of selected 
terms, showing interesting relationships. 
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The rest of the paper is organized as follows. Section 2 characterizes the main 
phases of the automatic query expansion process and discusses the rationale of using 
term-ranking methods based on distribution analysis. Section 3 precisely introduces 
the distributional methods tested in the experiments and evaluates their use to select 
expansion terms within Rocchio’s classical reweighting scheme in contrast with basic 
Rocchio. Section 4 evaluates the performance of the distributional methods when they 
are used to both select and weight expansion terms. Section 5 analyzes the 
performance variations of distributional methods on individual queries. Section 6 
describes a method to combine the results of multiple distributional query expansion 
methods and evaluate its performance. Section 7 discusses the role played by the main 
parameters involved in automatic query expansion in determining the overall 
effectiveness, and Section 8 provides some conclusions and directions for future 
work. 



2 Approaches to Automatic Query Expansion 

To better represent the user information need we can extract useful terms from the 
results of an initial retrieval run. The idea, not new ([1], [9]), is to consider the top 
few documents retrieved as being relevant, in the absence of any real relevance 
judgements. Working from this assumption, the process which leads to a query with 
modified weights and terms typically goes through three main phases: expansion term 
location, expansion term ranking, and weighting of expanded query. 



2.1 Expansion Term Location 

The typical source of evidence for expanding a given query is constituted by all the 
terms in the first r documents retrieved in response to the query from the collection at 
hand, although more sophisticated schemes for locating the candidate expansion terms 
have been proposed, such as using passages ([25], [14]), or using the result of past 
similar queries [12], or running the initial pass on a much larger collection than the 
target collection [23]. 



2.2 Expansion Term Selection 



The selection of expansion terms is usually performed by ranking candidate terms 
first, and then choosing the highest ranked terms. For ranking expansion terms, a 
number of different methods have been proposed, following two main conceptually 
distinct approaches. One straightforward solution is to rank the candidate expansion 
terms using the (primary) term weights w(t) computed for document ranking ([24], 
[17]). Usually, the score used for inclusion in the expanded query is given by 



k=\ 



, where the summation index ranges over the first r retrieved documents. 



This approach is simple and computationally efficient, but it has the disadvantage that 
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each term weight reflects the usefulness of that term with respect to the entire 
collection. In order to discriminate between good expansion terms and poor expansion 
terms it seems more convenient to consider occurrence in relevant documents in 
comparison to occurrence in all documents. In other words, one may assume that the 
differences between the distribution of terms in the overall document collection and 
the distribution of the same terms in a set of relevant documents are related to 
semantic factors. It is expected, in particular, that good terms will occur with a higher 
frequency in relevant documents than in the whole collection, and poor terms will 
occur with the same frequency (randomly) in both. An early example of this approach 
is in [11], where a comparative statistical analysis of term occurrences - via a chi- 
square variant - is used to suggest potentially relevant terms for interactive query 
expansion. A more general theoretical argument that supports the use of the 
differences in term distribution to select the terms to be included in the expanded 
query was provided by [18]. He showed that the inclusion of the term t in the 
expanded query will, under certain strong assumptions, increase the retrieval 
effectiveness by Wf (pf - q^, where Wf is the primary weight of the term t, and and 

qf are the probabilities that a relevant and a non-relevant document, respectively, 
contain the term t. In fact, variants of Robertson’s ranking scheme for expansion 
terms have subsequently been used by various systems, with different weighting 
functions and different methods for estimating pf and q^ ([4], [19], [14]). An 

alternative, more recent, approach to using the differences in term distribution for 
selecting expansion term relies on the relative entropy, or Kullback-Lieber distance, 
between the two distributions, from which a computationally simple and theoretically 
justified method to assign scores to candidate expansion terms can be derived [7]. 



2.3 Reweighting of Expanded Query 

Most systems that perform retrieval feedback rely on Rocchio’s formula [20], as 
improved by [21], to expand and reweight the query ([24], [23]). In the retrieval 
feedback setting, it is usually assumed that the relevant documents are the r top 
documents retrieved by the systems and that the information about the number of non- 
relevant documents is absent. The simplified formula becomes: 

tvvCOo . (1) 

‘ k=\ ‘ 

It should be noted that the simple method for ranking expansion terms illustrated 
above is based on their proposed Rocchio weights. We should also emphasize that 
some modified versions of Rocchio’s formula have recently been proposed that 
showed better performance than basic Rocchio on tasks involving proper relevance 
feedback ([5], [22]). We did not investigate such extensions in our experiments. 
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3 Using Distributional Term Selection Within Rocchio’s Weighting 
Scheme 

Having introduced different term-scoring methods, the first goal of our experiments 
was to evaluate the relative performance of these methods on selecting expansion 
terms. This comparison requires caution, because the overall retrieval effectiveness 
may be a compound effect that masks the variables under study. In order to ensure a 
controlled experiment, we varied only the method used for selecting expansion terms 
while keeping the other factors involved in the query expansion process constant. 
Most important, to reweight the query after selection of expansion terms we 
uniformly used Rocchio’s formula reported in expression (1), with =1, =1. We 

used as test collection the TREC-7 collection (TREC disks 4 and 5, containing 
approximately 2 Gigabytes of data) and query set (topics 351-400). The underlying 
basic ranking system used in the experiments by all four methods was developed in 
the context of our participation in TREC-7, and thus its data structures were 
specifically designed and implemented to efficiently handle the large TREC test 
collection. The system uses a vector space model with cosine normalization; 
documents and queries are weighted with the classical tfidf scheme, after word 
stopping and stemming. The same test collection and basic ranking system were also 
used in subsequent experiments. 

The five term-ranking functions tested in the experiment were the following (R 
indicates the pseudo-relevant set, C the whole collection, and w(t) is the weight of 
term t in the collection): 



- Rocchio’s weights: score(t) = ^ ^(Odoc 

r 

- Robertson Selection Value (RSV): ‘ score(t) = ^ ^(Odoc ' 

- CHI-square (CHI2): score(t) = [Pr( 0 - Pc(t)]^ / Pc(t)] 

- Doszkocs’ variant of CHI-square (CHID: score(t) = [p^(t) - Pc(t)] / Pc(t)] 

- Kullback-Lieber distance (KD): score(t) = [pj^(t) - P(;;(t)] • log [pj^(t) / Pc(t) ] 



We considered as candidate expansion terms those contained in R. To estimate 
Pr(1), we used the ratio between the frequency of t in R, treated as a long string, and 
the number of terms in R; analogously, to estimate Pc(t), we used the ratio between 
the frequency of t in C and the number of terms in C. The estimation of probabilities 

' We assumed, as also done in [19], that the probability that a non-relevant document 
contains the term t is negligible. 
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is an important issue because it might affect performance results. Although we have 
not fully worked out this aspect, we tried also different estimation functions such as 
the number of pseudo-relevant documents that contain the term ([5], [19]), which 
however seemed to produce worse retrieval effectiveness. Finally, all term-ranking 
methods tested in the experiment required two values for practical implementation: 
the number of pseudo-relevant training documents and the number of expansion terms 
considered for inclusion in the expanded query. Consistently with many TREC’s 
researchers, in our experiment the values of the two thresholds were set at 5 and 30, 
respectively. 

For each query, we ran the complete ranking system five times, one for each 
possible selection of the technique for selecting expansion terms. In Table 1 we report 
the retrieval performance of each method, averaged over the query set, and show the 
performance improvement over non-expanded query, used as a baseline. Performance 
was measured with the TREC’s standard evaluation measures. In Table 1, the 
distributional methods are labeled with an R subscript to indicate that they were 
coupled with Rocchio’s reweighting scheme. Asterisks are used to denote that the 
difference is statistically significant, using a one-tailed paired t test with a confidence 
level in excess of 95%. 

The results shows that expanded queries worked better than non-expanded queries 
for all expansion techniques and for all evaluation measures, with the main exception 
of „Prec-at-10“, although the differences usually were not statistically significant. 
Somewhat unexpectedly, the five expansion methods (Rocchio, RSVj^, CFII-lj^, CFII- 
2j^, and KDj^) obtained very similar average performance improvement over non- 

expanded query for all evaluation measures. Indeed, one of the most interesting 
findings of this experiment is that as long as we employ Rocchio’s formula for 
reweighting an expanded query, the use of a more sophisticated method for ranking 
expansion terms than Rocchio’s itself does not seem to produce, on average, any 
performance improvement. These results confirm and extend to a slightly different 
setting and a larger database earlier findings about the low importance of selection 
metrics in the performance of relevance feedback systems. 
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Table 1. Comparison of mean retrieval performance 





Non 

expanded 


ROCCHIO 


RSVr 


CHI-2r 


CHI-Ir 


KDr 


RET&REL 


38.56 


40.62 


40.54 


40.28 


40.10 


40.60 






+5.34%* 


+5.13% 

* 


+4.46% 

* 


+3.99% 


+5.29% 


AV-PREC 


0.1231 


0.1280 


0.1277 


0.1262 


0.1312 


0.1279 






+3.98% 


+3.74% 


+2.52% 


+6.51% 


+3.90% 


11 -PT- 


0.1502 


0.1529 


0.1526 


0.1518 


0.1567 


0.1531 


PREC 


















+1.84% 


+1.64% 


+1.07% 


+4.33% 


+1.93% 


R-PREC 


0.1694 


0.1773 


0.1766 


0.1776 


0.1824 


0.1765 






+4.69%* 


+4.25% 


+4.84% 

* 


+7.66% 


+4.19% 


PREC-AT-5 


0.3880 


0.3920 


0.3920 


0.3880 


0.4040 


0.3920 






+1.03% 


+1.03% 


0.00% 


+4.12% 


+1.03% 


PREC-AT- 


0.3380 


0.3360 


0.3340 


0.3300 


0.3380 


0.3340 


10 


















-0.59% 


-1.18% 


-2.37% 


0.00% 


-1.18% 



4 Comparing Distributional Reweighting Schemes to Rocchio 

The five term-scoring functions introduced above can be used not only to select the 
expansion terms but also to weight them in expression (1), instead of Rocchio’s 
weights. The overall reweighting function becomes: 

>^(0eexp = a X + P X score{t) (2) 

We compared the effectiveness of the five reweighting methods derived from 
equation (2) to Rocchio’s scheme (expression 1). The values of the several parameters 
needed to implement the four methods were chosen as in the earlier experiment (i.e., 5 
pseudo-relevant documents, 30 expansion terms, a=l, (3=1). In Table 2 we report the 
retrieval performance of each method, averaged over the query set, and again show 
the performance improvement over ranking with non-expanded query, used as a 
baseline. Table 2 shows that the performance of RSV was, on the whole, slightly 
inferior to Rocchio, while the other threee distributional methods clearly 
outperformed Rocchio (and RSV). Compared to the baseline, the performance of the 
best three distributional methods was still comparable when we considered a very 
limited number of retrieved documents (i.e., for „Prec-at-5“ and „Prec-at-10“), but it 
dramatically improved for all other evaluation measures, with statistically significant 
differences. 

Thus, the main result of this experiment is that when a distributional method for 
term selection is also used for query reweighting, the overall retrieval effectiveness 
may considerably improve. Although this finding should not be over-generalized, 
because it was obtained for a specific combination of the parameters involved in the 
weighting schemes, it suggests that that if we have a good method for ranking 
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expansion terms we should try to use it also for assigning weights to terms in the 
expanded query. The main rationale for this is that if one expansion term, for a given 
query, is correctly ranked ahead of another then it should receive a proportionally 
higher weight in the expanded query, while if we use for query reweighting a 
weighting scheme that computes an absolute value of term goodness ignoring the 
specific information associated with the query at hand, like Rocchio’s formula, then 
the better term might receive a lower weight than the worse term. The low 
performance of RSV is consistent with this observation, because the RSV score is 
more of a variant of Rocchio than a distinct reweighting function based on the 
differences in term distribution. In the rest of the paper we concentrate on the three 
fully-distributional methods (i.e., CHI2, CHIl, and KD). 

Table 2 shows also the three fully-distributional methods achieved more 
comparable retrieval performance with respect to one another. As these methods use 
different mathematical functions, we hypothesized that despite their similar mean 
effectiveness they would present considerable variation on individual queries. 
Therefore we decided to test this hypothesis through a query by query analysis. 



Table 2. Comparison of mean retrieval performance 





Non 


ROCCHIO 


RSV 


CHI-2 


CHI-1 


KD 




expande 

d 












ret&reL 


38.56 


40.62 


41.56 


43.38 


42.50 


43.16 






+5.34% 


+7.78% 

* 


+ 12.50% 
* 


+ 10.22% 
* 


+11.93% 


AV-PREC 


0.1231 


0.1280 


0.1243 


0.1466 


0.1471 


0.1409 






+3.93% 


+0.94% 


+19.05% 

* 


+19.46% 

* 


+14.39% 

* 


11-PT- 

PREC 


0.1502 


0.1529 


0.1469 


0.1695 


0.1720 


0.1644 






+1.84% 


-2.16% 


+12.87% 

* 


+14.53% 

* 


+9.46% 


R-PREC 


0.1694 


0.1773 


0.1683 


0.1912 


0.1970 


0.1840 






+4.69% 

* 


-0.61% 


+12.87% 

* 


+16.30% 

* 


+8.63%* 


PREC-AT-5 


0.3880 


0.3920 


0.3640 


0.3800 


0.4000 


0.3840 






+1.03% 


-6.19%* 


-2.06% 


+3.09% 


-1 .03% 


PREC-AT- 

10 


0.3380 


0.3360 


0.3320 


0.3520 


0.3620 


0.3400 






-0.59% 


-1.78% 


+4.14% 


+7.10%* 


+0.59% 



5 Performance Variation of Distributional Methods on Individual 
Queries 

Xu and Croft [25] used the overlap between the sets of suggested terms to compare 
the performance of different query expansion methods on single queries. We observed 
that in our case the use of such a simple evaluation measure would not help much 
disclosing the different behavior of the three methods due to their relatively high 
overlap. Thus, we used a more powerful measure related not also to which terms are 
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suggested by each method but also to how those terms are ranked. We also measured 
the relative retrieval effectiveness of the terms suggested by each method on single 
queries. 

Variations on term ranking . For each query and for each pair of methods we 
computed a measure of the difference between term rankings, considering only the 
first 30 terms suggested by each method. In particular, for each term in the ranked 
term list produced by one of the two methods, we computed the distance between the 
position of that term in the ranked lists produced by the two methods; if the term was 
not contained in the second list we assumed that it was ranked right after the last- 
ranked term (i.e., as 31st). We then averaged over the set of terms suggested for each 
query and over the set of queries. It should be noted that this measure is asymmetrical, 
because the results depends on which is the first selected method. The results are 
shown in Table 3; the first method to which each pairwise comparison refers is shown 
on columns. Considering that we used only the first 30 terms and that we assumed 
that all other terms were equally ranked as 31st, the most important finding is that the 
variation was substantially high for any pair of methods. In addition, the results show 
that the variation between KD and each of the other two methods was larger than that 
between CHI2 and CHI 1 , which are in nature more similar. 

Table 3. Mean term distance in pairwise term-ranking comparison (restricted to the first 30 
terms). 





CHI2 


CHh _ 


KD 


CHI 


0.00 


6.29 


11.58 


CHI 


5.21 


0.00 


13.85 


KD 


11.00 


13.77 


0.00 



Variation on retrieval effectiveness. For each query and for each expansion 
method, we measured the difference between the average precision obtained with 
expanded query and that obtained with non-expanded query. In Figure I we show for 
each query the minimum and maximum of such differences; thus, the length of each 
bar depicts the range of performance variations attainable by the three methods on 
each query. For most queries, the variations with respect to non-expanded query (x 
axis) were either all positive or all negative, as might be expected, although there 
were also a significant number of exceptions; most important, despite showing similar 
mean performance over the query set (see Table 2) but consistently with the term 
distance analysis, the inter-method variations on single queries were ample, with a 
mean value of 50.3%. Thus, methods which generated better terms on some queries 
produced poorer terms on others. The fact that individual methods disagreed with one 
another on individual queries while predicting, on average, equally good terms 
suggests trying combination strategies with the aim of retaining, on average, the most 
informative terms. This issue is discussed below. 
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Fig. 1. Performance variation of query expansion methods on individual queries. 



6 Combining Multiple Distributional Query Expansion Methods 

Recent research in machine learning and information retrieval has shown that 
ensembling multiple classifiers, whether produced by single or different learning 
algorithms, may be a viable technique for improving classification accuracy ([16], 
[3], [10]). Two keys to success are that the individual classifiers must disagree with 
one another and that their average accuracies must be comparable. In this case one 
can try to guess the right prediction by taking a majority vote, in the hope that the 
single classifiers make uncorrelated errors. 

In the retrieval feedback setting, the output of each method is represented by a 
ranked list of new terms instead of a sharp yes/no procedure as in concept 
classification. By analogy with ensembling classifiers, we can hypothesize that the 
individual methods make uncorrelated errors in suggesting new terms, i.e., when a 
term erroneusly gets a high rank in one method the same term gets a low rank in the 
other methods, so that a majority procedure can correctly rank the term. As described 
above, we successfully checked for mean performance and diversity of the individual 
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retrieval feedback techniques, so the next issue is how to ensemble the results of the 
individual methods. One simple approach is to compute an average score for each 
term from the scores assigned to that term by individual methods and then use these 
new scores. This approach however would require that the scores produced by each 
method will have similar absolute values, otherwise the average scores will be 
dominated by the method with high score. This condition was not met by the retrieval 
feedback feedback methods tested in our experiment, because the KD scores were 
comparatively higher than other methods’ scores. Therefore we took an alternative 
approach. 

As the individual methods presented quite large variations on the order in which 
terms were ranked, we decided to focus on the differences between the relative 
position of each term in the three rankings, ignoring the term scores. Thus, the ranks 
of the terms were averaged and the mean was used to rerank them. Once the ranks 
have been merged, the relevance score of the terms can be computed by using some 
inverse function of their final position. We used as a new term-scoring function the 
simple ratio between 1 and the position of the term; i.e., 1 for the first term, 1/2 for 
the second term, 1/3 for the third term, etc. The scores obtained this way were used in 
equation (2) to assign weights to the new terms, and the resulting combined 
reweighting method was tested for performance using the same parameter setting as 
previuos experiments with individual methods. 

The results are shown in Table 4, again with improvement over ranking with non- 
expanded query used as a baseline. A comparison between Table 4 and Table 2 shows 
that the combined method had better performance than any individual method for 
almost any evaluation measure, thus further improving the performance over non- 
expanded query. In particular, the performance improvement of average precision is 
especially notable (+21.34%) for this is the most informative evaluation measure of 
ranking performance. The results shown in Table 4 and Table 2 also indicate that the 
performance scores of the combined method represented, in general, a small 
improvement over the scores obtained by the best individual method. However, as 
combination strategies work best when the results being combined are generated 
independently [15], there are reasons to believe that such an improvement could be 
higher if we weakened some experimental parameters that are likely to increase the 
correlation between the term-relevance estimates of the individual methods (e.g., 
varying the document representation and the set of training data). Furthermore, the 
merging of several term ranks can be performed using more sophisticated techniques 
involving linear combination of individual ranks and parameter optimization, similar 
to work on combining multiple ranked document lists [2]. 

The results of this experiment should be taken with caution and cannot be easily 
generalized without further evidence, because they were obtained for specific 
ensembling methods and parameter combinations; nonetheless, since we used very 
simple and untuned functions, they represent an indication that this approach is 
feasible. 
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Table 4. Mean retrieval performance of combined method 



REL-RET 


AV-PREC 


11 -PT- 


R-PREC 


PREC-AT-5 


PREC-AT-10 


DOCS 




PREC 








45.72 


0.1494 


0.1733 


0.1930 


0.3920 


0.3560 


18.57%* 


+21.34%* 


+15.38%* 


+13.96%* 


+1.03% 


+5.33% 



7 Effect of Method Parameters on Performance 

As most approaches to automatic query expansion, including ensembling methods, 
rely on a number of parameters, it is important to study how these parameters affect 
performance. One of the key factor to success is the quality of the initial retrieval run. 
In particular, one might expect that query expansion will work well if the top 
retrieved documents are good and that it will perform badly if they are poor. Xu and 
Croft [25], for instance, found that pseudo-relevance feedback tends to hurt queries 
with baseline average precision less than 5%. To test this hypothesis more deeply, we 
studied how the retrieval effectiveness of the combined method varied as the 
difficulty of a query changed, where the latter was characterized by the average 
precision of the initial run relative to the given query (the lower the average precision, 
the greater the difficulty). The results are shown in Figure2. Each circle represents 
one of the 50 queries; if the circle is above (below) the bisecting line, then the 
performance increased (decreased) when we passed from non-expanded to expanded 
query. The query difficulty decreases as we move away from the origin. 

These results are somewhat unexpected, because no clear pattern seems to emerge. 
The performance improvement does not monotonically grow with easiness of query; 
indeed, if we split the x axis in intervals and compute the average performance of the 
queries within each interval, then it is easy to see that performance variation is 
initially negative, as expected, and then it starts climbing until it reaches a maximum 
(initial precision of 20-30%), after which it declines and may drop again below zero. 
In fact, our experiment supports the view that queries with low precision do not carry 
useful information for improvement, while queries with high initial precision can be 
hardly further improved upon; as an indication to achieve further mean improvement, 
one might develop selective policies for query expansion that focus on queries that are 
neither too difficult nor too easy. 

Two other main parameters of automatic query expansion systems are the number 
of pseudo-relevant documents used to collect expansion terms and the number of 
terms selected for query expansion. We performed some experiments to see how the 
retrieval performance varied as a function of these two parameters. Let us consider 
first the number of documents. Based on the ground that the density of relevant 
documents is higher for the top-ranked documents, one might think that the fewer the 
number of documents considered for expansion the better the retrieval performance. 
However, this was not the case. The retrieval performance was found to increase as 
the number of documents increased, at least for a small number of documents, and 
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then it gradually dropped as more documents were selected. This behavior can be 
explained considering that the percentage of truly relevant documents in the pseudo- 
relevant documents is not the only factor affecting performance here. If we select a 
very small number of pseudo-relevant documents, it is more likely that we will get, 
for some queries, no relevant document at all, which may produce very bad results on 
those queries and a mean performance degradation. Thus, the optimal choice should 
represent a compromise between the maximization of the percentage of relevant 
documents and the presence of at least some relevant document. Consistently with 
the results reported above, we found that these two parameters were best balanced 
when the size of the training set ranged from 4 to 12; for smaller sizes the number of 
queries with no relevant documents was proportionally higher, for larger sizes the 
percentage of nonrelevant documents grew large. 




Fig. 2. Improvement versus initial query difficulty 



The results concerning the variation of the retrieval performance with the number 
of expansion term were more predictable. We found that the performance 
improvement initially increased as more terms were selected, at least as long as we 
selected truly new terms (consider that the first suggested terms usually coincide with 
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the original query terms), and then it gradually decreased as more and more less- 
informative terms were chosen. 



8 Conclusions 

This paper extended earlier results about the effectiveness of automatic query 
expansion techniques in several directions. In particular, from our experimental 
evaluation, three main conclusion can be drawn. 

• Term-scoring methods based on distribution analysis are not likely to improve 
performance when they are used only to select expansion terms, but the same 
methods may produce a considerable performance improvement when they are 
used to both select and reweight the expansion terms. 

• The combination of the set of expansion terms produced by different distributional 
methods may perform better than the individual methods. 

• The retrieval performance of automatic query expansion usually increases as the 
query difficult decreases, but it may decrease as the query becomes very easy. 
Similarly, the optimal number of pseudo-relevant documents and expansion terms 
should represent a compromise between using little new information and much 
new information. 

While we mainly focused on term selection and term reweighting, there are also 
other aspects of the proposed approach to query expansion that need be evaluated 
more carefully such as robustness of probability estimation and combination of 
multiple results. Aside from experimental investigations, we need a better theoretical 
understanding of the relative strengths and weaknesses of the individual query 
expansion techniques and of why their combination may work well. Also, having 
ascertained the importance of term reweighting over term selection in the good 
performance of distributional methods in a pseudo-relevance feedback task, it is 
tempting to see if these methods can be used as primary term-weighting schemes, in a 
proper relevance feedback environment. Finally, our approach could be used to 
generate good search terms not only for automatic query expansion but also in 
interactive searches, with the aim of help users to expand or refine a query based on 
the actual content of the collection. We are currently investigating these issues. 
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Abstract. Resource discovery in a distributed digital library poses many 
challenges, one of which is how to choose search engines for query distribution, 
given a query and a set of search engines. This paper focuses on search engine 
performance as a criterion for search engine selection and defines two 
measurements of search engine performance: availability - will the search 
engine respond within a time limit, and response time - how quickly will the 
search engine respond, given that it responds at all. We predicted both of these 
performance characteristics with a variety of algorithms, all of which required 
little computation time and combined past performance data for each search 
engine into a succinct record. We used operational data from the NCSTRL 
distributed digital library to make and evaluate predictions, and we found that 
simple prediction methods performed as well as more complex methods and 
that prediction accuracy was closely related to data consistency. 



1 Introduction 

It has been said that the Internet, and the wide range of products and services made 
available there, has created a culture of instant gratification among networked 
computer users. In particular, Internet users expect swift and accurate responses to 
their search requests, but it can be difficult to fulfill these expectations in the rapidly 
expanding World Wide Web. Virtually all Web resource discovery tools are based on 
a centralized architecture, in which a central service creates and deploys a master 
index, possibly replicated for localized network access. While the utility of this 
architecture has been proven, there are inherent constraints to the centralized 
approach, including scalability, lack of domain specificity and intellectual property 
restrictions [15]. 

One approach offering promising solutions to these problems is distributed 
searching, in which query processing is distributed among a set of decentralized 
search engines. Rights management issues can be addressed via licensing agreements 
pertaining to specific search engines or sets of documents. Individual search engines 
can cater to the unique needs of a collection or a user community. Also, queries can 
be processed in parallel, reducing scalability issues. 

S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL ‘99, LNCS 1696, pp. 142-166, 1999 
© Springer-Verlag Berlin Heidelberg 1 999 
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A key question for distributed searching is: given a query and a set of search 
engines, which ones should be selected? Selection can be based on information 
content of the search engines and other factors including load, cost, licensing 
agreements, network latency and server reliability. These selection criteria require 
sophisticated mechanisms to work properly and reliably. For example, there may be 
tradeoffs between selection criteria, or search engines may have disparate methods for 
determining whether or not to grant access privileges. Once the full set of useful 
servers has been identified, we would like to select the most reliable and fastest of 
them to process our query. 

In this paper, we focus on search engine performance with respect to query 
distribution: in particular, we tried to predict search engine performance. Accurate 
performance predictions could be used to choose among search engines that index the 
same information, thereby reducing the time a user waits for search results. 

The paper is structured as follows. Section 2 describes the logical components of 
distributed searching architecture and our approach to the query distribution problem. 
Section 3 describes the algorithms used to make predictions of search engine 
performance. Section 4 presents the efficacy of our predictions regarding whether or 
not a search engine will respond before a time limit. Section 5 presents the efficacy 
of our predictions regarding how quickly a search engine will respond, given that it 
responds before the time limit. Concluding remarks are presented in section 6. 



2 An Approach to Distributed Searching 

We have been investigating distributed resource discovery issues in the broader 
context of federated digital library architecture [18]. This architecture builds digital 
libraries from distinct sets of individual services, each with defined functionality and 
the ability to communicate among themselves and to each other using defined 
protocols. The advantages of this architecture include scalability, easy extension of 
functionality, and its support of semi-autonomous management of distributed 
services: participants retain autonomy over their components while using the common 
protocol to communicate with other services in the digital library. 

Three of the services in this model are: user interfaces, performing digital library 
functions pertaining directly to user interaction; repositories, which store and access 
digital documents; and indexers, which index information (metadata or full text) for 
digital documents and process queries on that information. 

The information indexed at a particular indexer may be a replica of that at other 
indexers, may be completely disjoint, or it may overlap the information at other 
indexers in various ways. How a federated digital library apportions information 
among indexers depends on a mixture of administrative decisions, rights management 
issues and fault tolerance concerns. This creates a need for a mechanism to select 
indexers for query distribution based on content (choosing indexers that have 
information relevant to the query) as well as other factors such as cost or performance 
(choosing among multiple indexers indexing the same information). 

Other researchers have investigated a variety of issues relevant to distributed 
searching. For example, the distributed database community has a long history of 
investigating the optimal distribution of indexing information across LANs and 
controlled WANs [5]. Research areas in the digital library community include query 
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translation [4], content summarization for query routing [14] [3], content routing [10] 
[12], collection formation via sharing of metadata and index information [19] and 
protocols for meta-searching and metadata collection [1] [13]. In addition there have 
been a number of comparative studies of database selection algorithms [11] [9] and 
performance evaluations of distributed architectures [2]. 

Our research focuses on the selection of indexers with respect to the performance 
of networks and of the indexers themselves. In [8], we introduced the query mediator 
as a digital library service which functions as an intermediary between user interfaces 
(UIs) and indexers. Specifically, query mediators (QMs) are responsible for 
translating queries to indexer protocols, choosing indexers for query processing, 
routing queries to those indexers, adaptively reacting to operational conditions, and 
aggregating query results. If a QM performs well, it will rapidly deliver complete 
search results to the UI. If a QM chooses indexers poorly, or makes poor adaptations 
to operational conditions, then the digital library user could receive slow or 
incomplete search results. 

In [7], we found that on average QMs spend 44-54% of their time waiting for 
indexers to respond. We would like to improve the QM mechanism for choosing 
among overlapped or replicated indexers in order to reduce wait time for indexers, but 
determining which indexers fit that description is difficult. Indexer reliability and 
processing speed depend on factors such as indexer hardware characteristics, size of 
the indexes and current CPU load. In [8], we showed that an indexer’s performance 
does not appear the same to a QM as it appears to the indexer itself From the 
perspective of a QM, or the QM-view, indexer performance depends on additional 
factors that are hard to predict, such as network connectivity and network loads. If we 
could accurately predict QM-view indexer performance, then we could improve the 
QM’s choice of indexers, reducing QM wait time and thus, user wait time. 

Our predictions focused on two key aspects of indexer performance from the QM- 
view: 

• Availability: will the indexer respond within a time limit (such as a search 
timeout)? Whether or not an indexer responds to a QM is dependent on whether 
the indexer is currently running and on the network, how long the QM listens 
for a response, in addition to network conditions, indexer CPU load, etc. If we 
could accurately predict QM-view indexer availability, then QMs wouldn’t 
waste time waiting for responses from indexers that will never respond. 

• Response time: how quickly will the indexer respond, given that it responds at 
all? This is the elapsed time between the moment the query is sent from the 
QM and the moment results are received at the QM from the indexer. If we 
could accurately predict QM-view indexer response time, then the QM could 
direct queries to the fastest indexers, and could choose judicious search timeouts 
as well. 
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2.1 NCSTRL - A Distributed Digital Library Testbed 

The distributed digital library on which we base our research is the Networked 
Computer Science Technical Reference Library' (NCSTRL - pronounced 
“ancestral”). NCSTRL is an operational digital library employing a distributed, 
component-based architecture. The NCSTRL collection is globally distributed and 
made available through the Dienst [16] federated digital library architecture. Dienst is 
an open architecture and protocol [6] for distributed digital libraries that was 
developed as part of the DARP A- funded Computer Science Technical Reports 
Project^. These characteristics - global distribution, open interface, and production 
availability - make NCSTRL an ideal testbed for distributed digital library research 
(indeed, NCSTRL is one of the collections in the DARPA-funded Distributed 
Integration Testbed^). 

The NCSTRL collection consists of institutions, or publishing organizations, each 
of which (at a minimum) provides a repository of digital documents and descriptive 
metadata [17] for those documents. These institutions are a combination of Ph.D. 
granting computer science departments, ePrint repositories, electronic journals, and 
research institutions. At the time of publication of this paper, there were over 100 
NCSTRL repositories and approximately 50 NCSTRL indexers worldwide. 

The Dienst architecture specifies the operational characteristics of semi- 
autonomous core digital library services, as well as describing an open, extensible 
protocol for communicating among and with these digital library services. Core 
services include repositories, indexers and user interface gateways, as well as 
collection services which provide the mechanisms for federating these and other 
services into a digital library. While not formally defined in Dienst as a separate 
digital library service, the functionality of the QM is present in NCSTRL. 

Each Dienst server, which implements and provides protocol access to a set of 
services, maintains logs containing operational and statistics messages. We obtained 
QM-view indexer performance data by analyzing Dienst logs from the following five 
NCSTRL servers for the period from March 1, 1997 through April 30, 1997: 

1. NCSTRL - the home page of NCSTRL, located at Cornell University. 

2. CS-TR - the Cornell University Department of Computer Science Dienst server. 

3. LITE - the Dienst server at the University of Virginia. 

4. BERKELEY - the University of California at Berkeley Dienst server. 

5. FORTH - the Institute of Computer Science, Foundation for Research and 
Technology - Hellas (ICS-FORTH) Dienst server. 

Full details of how the logs were analyzed can be found in [7]. 



' http://www.ncstrl.org 
^ http://www.cnri.reston.va.us/cstr.html 
^ http ://www.cnri.reston. va.us/integration-testbed.html 
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3 Algorithms for Predicting Indexer Performance 

As mentioned above, our goal is to improve the QM mechanism for choosing among 
overlapped or replicated indexers. If we could accurately predict indexer 
performance from the QM-view, then we could optimize the choice of indexers made 
by the QM, reducing QM wait time and thus, user wait time. QM-view indexer 
predictions need to address the following questions: 

1 . Availability — will the indexer respond before the search timeout? 

2. Response time - how quickly will the indexer respond, given that it responds at all? 

Answering these predictive questions accurately would be easy if the indexer 
performance data followed an observable pattern. Unfortunately, we were unable to 
discern clear overall patterns for indexer availability or response time in our data. 
Since we couldn’t perceive patterns in the data, we applied a variety of predictive 
methods to QM-view indexer data. All methods we used combined past performance 
data for a given indexer at a particular QM into a succinct record and required little 
QM processing time. 

Two of the predictive methods we used averaged previous observations: 

Running average. The prediction is the average of all previous observations. A 
running average can also be limited the k most recent observations; we refer to this as 
a window of size k. In this paper, we are using the running average with the 
maximum window size: k is always as large as possible. 

Single last observation. The last observation is used as a prediction. This is 
equivalent to a running average with a window size of one. 

Our other predictive methods decayed old observations, as we believed that more 
recent data would be a better predictor of indexer behavior than older data. 

Low pass filter. The prediction is the average recent behavior of an indexer. Old 
observations are decayed exponentially with the following low pass filter formula: 

V„ = (m .V„,) + ((l-m).X) ( 1 ) 

V„ is the new value of the low pass filter (as well as the prediction) 

V„.j is the old value in the filter (before the most recent observation) 

X is the most recent observation 
m is a weighting parameter between 0 and 1 . 

If m = 0, then the low pass filter is the same as the most recent observation - its 
predictions would be the same as the single last observation method. If m = I, then 
the filter never changes - all predictions would be the initial filter value, assigned 
before the first observation. 

Ideally, we want to optimize m so that the low pass filter predictions are as accurate 
as possible. We used m = 0.95 for our low pass filter, based on the work of Vingralek, 
Breitbart, Sayal and Scheuermann in [20]. 

While a low pass filter weights recent data more heavily than older data, it has a 
flaw: the weight on each observation decreases exponentially by the number of 

observations, rather than by the time elapsed between observations. We addressed 
this problem with timed low pass filters. 
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3.1 The Timed Low Pass Filter 

Since queries are not routed to indexers at regular intervals, we wanted a prediction 
formula that weighted previous observations according to the freshness of the data: if 
the most recent observation was very old, we wanted to lessen its weight, and if it was 
very recent, we wanted to weight it heavily. 

Like a low pass filter, a timed low pass filter gives an average of recent indexer 
behavior with old observations decayed exponentially, except the weighting depends 
on the time elapsed between fdter updates. Timed low pass filters are updated via the 
following timed low pass fdter update formula: 

V„ = (m°.V„,) + ((l-m“).X) (2) 

is the new value of the timed low pass filter (but not the prediction) 

V„.j, X, and m are as defined for Equation 1 
D is the elapsed time since the last filter update. 

Again, if m = 0, then the timed low pass filter is the same as the most recent 
observation. If m = 1, then the value in the filter never changes. 

As D increases, m“ decreases. So as time passes, the old filter value contributes 
less to the new filter value — the filter memory decreases. 

Note that Equation 2 is for updating the timed low pass filter, not for making 
predictions. That’s because the prediction time is not the same as the filter update 
time. When a QM receives a response from an indexer, it applies Equation 2 to 
update the value in the timed low pass filter. However, when a QM is choosing among 
overlapped or replicated indexers, it applies one of the formulas in Equation 3 to 
predict indexer performance using the timed low pass filter. 

Method A: P = (m° «Q) + ( ( 1 - m°) •'V) (3) 

Method B: P = (m° •¥) + ( ( 1 - m°) •Q)) 

P is the prediction 
V is the timed low pass filter value 

Q is some predictive value, such as an initial prediction or the running average 
m is a weighting parameter between 0 and 1 which we determine empirically. 

D is the elapsed time since the timed low pass filter was last updated. 

Our choice of m for the timed low pass filter was decided empirically and is 
illustrated in section 3.3 for our example. 

Equation 3 has two prediction formulas, method A and method B, because we were 
unsure whether the timed low pass filter value or some other value was a better 
approximation of the “most recent” value when making a timed low pass filter 
prediction. That is, method B is very similar to Equation 2: V in method B is 
analogous to V„.j in Equation 2 and Q in method B is analogous to X in Equation 2, 
the “most recenf’ observation. Method A is the same as method B with Q and V 
swapped - in method A, we view V as the “most recenf’ observation. 
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3.2 Making Predictions - an Example 

As we indicated above, we are interested in predicting indexer behavior from the 
perspective of a QM by using past QM-view indexer performance data; these 
predictions would inform QM indexer choices when processing queries. Our 
approach is to simulate predictions for the indexer performance data we gathered 
from NCSTRL logs in [7] in order to assess which predictive methods are most 
effective. 



Table 1. Example of QM-view indexer response time data 



obs time (in seconds elansed since time O') 


13 17 19 28 39 45 49 51 


obs value (indexer response time in seconds) 


367 11 5735 



Table 1 contains example indexer response times for a particular indexer from the 
view of a particular QM. Note that we have the response times of indexers and when 
these response times were recorded at the QM, but we must extrapolate the following 
information if we are going to simulate predictions for the observed data: 

• Time of predictions. The QM needs to make predictions before it chooses 
indexers and sends them queries, so predictions must occur not only before the 
observed data is recorded, but before the query was sent to the indexer. In the 
case of the timed low pass //Vfer predictions, the time of the prediction affects the 
predicted value. We approximated the time of prediction as follows: 

T^ = T„-R-x (4) 

Tp is the time of prediction (in seconds elapsed since time 0) 

T^ is the time of the observation (in seconds elapsed since time 0) 

R is the indexer response time (in seconds) 

X is some constant representing the overhead time for the QM to compute 
predictions, choose indexers, and send the queries to the indexers (in 
seconds). 

For example, for the first observation in Table 1, T^ is 13 and R is 3. If we let 
X = 0.7, then the time of prediction for the first observation is (13 - 3 - 0.7), or 
9.3. 

• Time of data structure updates. We assumed the QM updated data structures 
soon after it recorded indexer performance data; we used a constant to represent 
the amount of time the QM took between recording indexer performance and 
updating filters. For example, if the constant was 0.5 seconds, then the data 
structures would have been updated at 13.5 seconds to incorporate the first 
observation in Table 1, at 17.5 seconds to incorporate the second observation, 
etc. The timing of data structure updates affects the values placed in the low 
pass filter and the timed low pass filter. 

• Initial values for the data structures. We set the initial values of all data 
structures to the mean of the data and in the case of the low pass filter and the 
timed low pass filter, set the time of the initial filter updates to zero. The mean 
of the data is unknown at time zero, but since the methods decay old 
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information quickly, the initial value has little effect on any but the first few 
predictions. 

• Initial prediction values. We used the mean of the data for the initial prediction 
for all methods except the timed low pass filter predictions. Timed low pass 
filters compute an initial prediction based on the time of the prediction, the value 
in the timed low pass filter, the value of Q (for which we used the running 
average) and the value of m (which we set to 0.95 in this example). The mean 
of the data is unknown at time zero, but all of our prediction methods decay old 
information quickly enough so that the initial value has little effect on any but 
the first few predictions. 
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Fig. 1. 



elapsed time (seconds) 

Data structure updates, predictions and observations as they occur in time 



Figure 1 shows the observations in Table 1, updates of prediction data structures 
and timed low pass filter (tipf) predictions as they occur in time. Recall that we used 
the running average for Q in the t/pf methods (Equation 3), and set m to 0.95. The 
single last observation, the running average, the low pass filter and the tfpf are all 
initialized at time zero to the mean of the observed data, 5.9 seconds. All four of 
these are updated at the time of data structure updates (see above), or just after the 
observations are recorded. (In this example, we assumed filter updates occurred 0.5 
seconds after observations were recorded.) For example, soon after the observation 
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occurring at time 28, the single last observation was updated to the observed value of 
11 seconds, the running average was updated to 6.6 seconds, the low pass filter was 
updated to 6.1 seconds and the timed low pass filter was updated to 7.2 seconds. 

The single last observation, the running average and the low pass filter methods all 
predict the next observation based on whatever value is in them at prediction time. 
For example, in Figure 1, at time 35 the single last observation predicts a response 
time of 1 1, the running average predicts a response time of 6.6, and the low pass filter 
predicts a response time of 6. 1 . These predictions will stay the same until these data 
structures are updated with the next prediction; in our example, the prediction of these 
methods is the same at time 30 as it is at time 37. 

In the case of the timed low pass filter predictions, the time of prediction is a factor 
in the predicted value, and we approximated the prediction time according to 
Equation 4. For example, in Figure 1, at time 39, there is an observed value of 5, and 
we have set the prediction time constant x in Equation 4 to 0.7 seconds. So the tipf 
predictions for the observation occurring at elapsed time 39 are prepared at time (39 - 
5 - 0.7) or at elapsed time 33.3. 




Note that the running average, low pass filter and timed low pass filter value all 
smooth the data over time: the highest and lowest observed values are mitigated by 
these methods. 
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Since it’s hard to tell from Figure 1 which predictions are for which observed data 
points, we aligned the predictions from all methods with the observed values they 
were attempting to predict in Figure 2. For example, the single last observation 
method predicted a value of 7.0 for the observation that occurred at time 28; the 
running average method predicted 5.5, the low pass filter predicted 5.8, tipf method A 
predicted 5.4 and tipf method B predicted 5.0. 



3.3 Tuning the Timed Low Pass Filter 

In the example in section 3.2, we used a value of 0.95 for m in the timed low pass filter 
updates (Equation 2) and predictions (Equation 3). We would like to choose a value 
of m that makes the f/pf predictions as accurate as possible. We chose a constant for 
m empirically for our experiment, rather than by solving Equation 2 and Equation 3 
for m using data from [7]. 

Table 2. MSE for tipf prediction methods A and B for different m values for section 
3.2 example 



m value 


MSE for tipf prediction 
method A method B 


0 


11.78 


7.63 


0.1 


11.79 


7.62 


0.2 


11.82 


7.61 


0.3 


11.86 


7.57 


0.4 


11.87 


7.51 


0.5 


11.80 


7.46 


0.6 


11.58 


7.48 


0.7 


11.08 


7.69 


0.8 


10.11 


8.29 


0.9 


8.65 


9.08 


0.99 


7.53 


6.58 


0.999 


7.61 


5.93 


0.9999 


7.63 


5.87 


0.99999 


7.63 


5.86 


0.999999 


7.63 


5.86 


1 


7.63 


5.86 



First we predicted indexer behavior with various m values in the f/pf equations, and 
then we evaluated the accuracy of the generated predictions associated with each m 
value by comparing predictions to observed data using mean square error (MSE). In 
the case of the example in section 3.2, the m values and their corresponding MSE for 
tipf prediction methods A and B are shown in Table 2. 

We can see from Table 2 that in our example, m value 0.99 gives the lowest MSE 
for tipf prediction method A, while m values 0.99999, 0.999999 and 1.0 all give the 
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lowest MSE of 5.86 for tipf prediction method B. So the m value that minimizes the 
MSB for method A is not necessarily the m value that minimizes MSE for method B. 

In our example, we predicted behavior for one indexer. But QMs send queries to 
many indexers, so m values must be chosen for each indexer to which the QM might 
send a query. In our experiment, we computed the MSE for each m value shown in 
Table 2 for each indexer contacted by the QM and picked the m value with the 
smallest MSE to use when comparing tipf methods with other predictive methods. 
When we combined all indexer predictions at a QM for the tipf methods, we chose the 
best m value for each indexer: there is not one single m value for a QM, but a separate 
m value for each indexer contacted by the QM. 

It is important to note that our method of choosing m values allows for no cross 
training: m values are optimized on the same data that we used to make and evaluate 
predictions. 



4 Predictions of Indexer Availability 

We’ve said before that QMs could reduce user wait time if they selected reliable 
indexers when choosing among overlapped or replicated indexers for query 
distribution. We also stated that one predictive question QMs need to address is: 

Availability: will the indexer respond to this QM before the search timeout? 

Availability data was recorded as a binary measurement: either the indexer 
responded before the timeout (value 1) or it did not (value 0). When we applied our 
predictive algorithms to this data, they produced a number between 0 and 1 , which we 
then rounded in order to get a predictive value. For example, an availability 
prediction of 0.3 meant we predicted the indexer would not respond before the search 
timeout; a prediction of 0.7 meant we predicted the indexer would respond before the 
timeout. 

We predicted indexer availability using all methods delineated in section 3: 

1. single last observation 

2. running average 

3. tow pass fitter (with m = 0.95, per [20]) 

4. timed tow pass fitter predictive method A 

5. timed tow pass fitter predictive method B. 

As in the example in section 3.2, for timed tow pass fitter methods A and B we used 
the running average for Q in Equation 3 and we initialized data structures at elapsed 
time zero to the mean of the observations. We approximated the overhead time for 
the QM to compute predictions (x in Equation 4) as 2 seconds, and we assumed the 
data structure updates occurred in the same moment (within one second) that the 
indexer performance data was recorded in the logs. 

In addition to the five methods described in section 3, we also made predictions 
using: 
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6. the timed low pass filter value itself. Since we had this information available, 
we used it as a predictive method, primarily to compare it with the low pass 
///fer predictions and the timed low pass ff/fer predictive method results. 

7. the mean of all the data. By “all the data”, we mean all the observations we 
recorded. Since the mean of all the data could not be known at prediction time, 
we viewed this method as a sort of control or point of reference for our results. 

In the remainder of this section, we examine our choice of m for f/pf methods (4, 5 
and 6 above). We then analyze the accuracy of the availability predictions for all 
seven methods, determining which methods are superior. Last, we compare the 
prediction accuracy to the consistency of the availability data itself 



4.1 Tuning the Timed Low Pass Filter 

As noted in section 3.3, for the timed low pass filter algorithms, each QM must 
choose an m value for each indexer to which queries could potentially be distributed. 
We chose m values empirically: for each QM in our study, we ran all three of the tipf 
algorithms (methods 4, 5 and 6 above) with 1 6 particular values for m for each of the 
indexers the QM contacted. 

Table 3. MSB of tlpf method B availability predictions for CS-TR QM by m value 



m value 


A 


B 


C 


D 


B 


A = 


cs-tr.cs.comell.edu:80 


0 


0.01 


0.40 


0.12 


0.40 


0.31 


B = 


cs.nyu.edu: 80 


0.1 


0.01 


0.40 


0.12 


0.40 


0.31 


C = 


ncstrl.cc.vt.edu:8080 


0.2 


0.01 


0.40 


0.12 


0.40 


0.31 


D = 


ncstrl.cc.vt.edu:8081 


0.3 


0.01 


0.40 


0.12 


0.40 


0.31 


B = 


www.cc.gatech.edu:8 1 


0.4 


0.01 


0.40 


0.12 


0.40 


0.31 






0.5 


0.01 


0.39 


0.12 


0.40 


0.31 






0.6 


0.01 


0.39 


0.12 


0.40 


0.31 






0.7 


0.01 


0.39 


0.12 


0.40 


0.31 






0.8 


0.01 


0.39 


0.11 


0.39 


0.31 






0.9 


0.01 


0.39 


0.11 


0.40 


0.30 






0.99 


0.01 


0.39 


0.12 


0.39 


0.25 






0.999 


0.02 


0.41 


0.15 


0.40 


0.14 






0.9999 


0.01 


0.56 


0.12 


0.46 


0.17 






0.99999 


0.99 


0.61 


0.88 


0.56 


0.32 






0.999999 


0.99 


0.61 


0.88 


0.39 


0.32 






1 


0.99 


0.61 


0.88 


0.39 


0.68 







Table 3 shows the MSB for tlpf method B availability predictions for different m 
values for five indexers contacted by the CS-TR QM. The MSB for the 16 different m 
values for these five indexers illustrate some of the patterns we saw when choosing 
“best” m values for a given indexer (for a particular tlpf algorithm for a particular 
QM). For example, indexer B’s MSB distribution is unimodal, with the lowest MSB 
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of 0.14 occurring at m value 0.999, while the MSEs shown for indexers C and D are 
multimodal. 

Given information like that in Table 3, we chose the “best” m value - the value, or 
one of the values, producing the lowest MSE - for each t/pf method for each indexer 
contacted by a QM. We used that “best” m value when comparing tipf predictive 
methods against the other predictive methods for each indexer contacted by each QM. 

When we looked at the MSE for a tIpf method for “all indexers” contacted by a 
QM, we chose the best m value for each indexer. In other words, the f/pf method for 
“all indexers” does not use one single m value for the QM, but a separate m value for 
each indexer contacted by the QM in this combined data. 



Table 4. Availability prediction MSE for QM CS-TR 



indexer 


no. obs 


mean of 
all data 


single 
last obs 


running 

average 


low pass 
filter 


timed low pass filter 
tipf value method A method B 


cs-tr.cs.comell.edu: 80 


14,868 


0.01 


0.01 


0.01 


0.01 


0.01 


0.01 


0.01 


cs.nyu.edu: 80 


1,869 


0.39 


0.37 


0.40 


0.37 


0.36 


0.37 


0.39 


dri.comell.edu: 80 


14,172 


0.00 


0.01 


0.00 


0.01 


0.01 


0.00 


0.00 


ei.cs.vt.edu: 8090 


13,653 


0.01 


0.02 


0.01 


0.02 


0.01 


0.01 


0.01 


lite.ncstrl.org:3803 


10,370 


0.47 


0.29 


0.47 


0.29 


0.29 


0.29 


0.40 


ncstrl.cc.vt.edu: 8080 


16,957 


0.12 


0.12 


0.12 


0.12 


0.12 


0.12 


0.11 


ncstrl.cc.vt.edu: 808 1 


1,115 


0.39 


0.39 


0.40 


0.39 


0.39 


0.40 


0.39 


ncstrl.cc.vt.edu: 8090 


15,824 


0.05 


0.07 


0.05 


0.07 


0.05 


0.05 


0.05 


ncstrl.cs.comell.edu: 8090 


13,057 


0.13 


0.17 


0.13 


0.17 


0.17 


0.13 


0.12 


www.cc.gatech.edu: 8 1 


3,076 


0.32 


0.03 


0.32 


0.03 


0.03 


0.03 


0.14 


www.cs.dartmouth.edu:80 


2,510 


0.13 


0.15 


0.13 


0.15 


0.13 


0.13 


0.13 


www.cs.uiuc.edu:80 


2,785 


0.02 


0.02 


0.02 


0.02 


0.02 


0.02 


0.02 


www.cs.umass.edu:80 


13,993 


0.11 


0.15 


0.11 


0.15 


0.15 


0.11 


0.11 


www.cs.umd.edu: 80 


2,322 


0.38 


0.35 


0.38 


0.35 


0.35 


0.35 


0.37 


www.cs.utah. edu:80 


910 


0.06 


0.06 


0.07 


0.06 


0.06 


0.07 


0.06 


www.icase.edu: 80 


13,761 


0.09 


0.14 


0.09 


0.14 


0.14 


0.09 


0.09 


www.ics.foith.gr: 7000 


5,162 


0.11 


0.06 


0.11 


0.06 


0.06 


0.06 


0.10 


www.tc.comell.edu: 80 


13,988 


0.01 


0.00 


0.01 


0.00 


0.00 


0.00 


0.01 


all indexers 


160,392 


0.10 


0.10 


0.10 


0.10 


0.09 


0.08 


0.09 



4.2 Availability Prediction Results 

Predictions are evaluated by comparing them to observations using mean square error 
(MSE). Since both the availability data and the availability predictions had binary 
values of 0 and 1, the error for any observation could be 0, 1, or -1. This means that 
the MSE is the same as the mean of the absolute value of the error: a MSE of .10 
implies that one out of ten predictions was incorrect. 

Table 4 shows the MSE for each of the availability predictive methods for QM CS- 
TR. For all but two indexers contacted by the CS-TR QM, any predictive method has 
a MSE within 0.05 of any other predictive method. For example, indexer 
cs.nyu.edu.SO has MSEs ranging from a low of 0.36 for the tipf value method to a 
high of 0.40 for the running average method. The two indexers that have more 
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widely varying MSB values across different predictive methods are 
lite.ncstrl.org:3803, with MSB values ranging from 0.29 to 0.47, and 
www.cc.gatech.edu:81 , with MSB values ranging from 0.03 to 0.32. Since results are 
similar for all predictive methods, we surmise that any pattern, or lack thereof, in the 
availability data affects all predictive methods similarly. In fact, the single last 
observation method and the low pass filter method (with m = 0.95) have identical 
MSB values for all indexers. Similarly, the mean of all data and the running average 
MSB values are equal for all but three indexers {cs.nyu.edu:80, ncstrl.cc.vt.edu:8081, 
and www.es. Utah. edu:80), for which the running average MSB is 0.01 greater than the 
mean of all data MSB. 

The MSB of availability predictions for the other four QMs in our study are similar 
to those presented in Table 4; we present the MSB for the combined data of all polled 
indexers for each of the QMs in Table 5. 

Table 5. MSB for all availability predictions 



OM 


no. obs 


mean of 
all data 


single 
last obs 


running 

average 


low pass 
filter 


timed low pass filter 
tlnf value method A method B 


cs-tr 


160,392 


0.10 


0.10 


0.10 


0.10 


0.09 


0.08 


0.09 


ncstrl 


113,511 


0.09 


0.09 


0.09 


0.09 


0.09 


0.07 


0.08 


berkeley 


69,483 


0.13 


0.12 


0.13 


0.12 


0.13 


0.11 


0.11 


lite 


6,363 


0.09 


0.07 


0.09 


0.07 


0.07 


0.07 


0.07 


forth 


743 


0.14 


0.13 


0.16 


0.13 


0.13 


0.12 


0.13 



In Table 5, the MSB ranges narrowly, from 0.07 for a variety of prediction 
methods for the LITE QM to 0.16 for the running average method for the FORTH QM. 
This reinforces our conclusion from Table 4: all availability predictive methods have 
similar results. In Table 5, the MSB for any method for a given QM is at most 0.04 
different from the MSB for any other method for the same QM. We do note that tipf 
methods A and B have a slightly lower MSB than any of the other methods, but the 
single last observation method performs nearly as well and requires no optimization 
of formula variables (such as m). Moreover, the single last observation method is 
trivial to compute and has minimal start up costs. 

Given that the vast majority of the predictions were for the CS-TR and NCSTRL 
QMs, which had maximum MSB values of 0.10 and 0.09 respectively, we can say that 
approximately 90% of our indexer availability predictions were accurate regardless of 
the predictive method used. 

As another means of comparing the different predictive methods, we now examine 
Table 6, showing the highest MSB for each predictive method for an individual 
indexer at each QM in our study. Note that multiple indexers contacted by each QM 
may be represented in Table 6: the indexer with the highest MSB for one predictive 
method may be different from the indexer with the least accurate predictions for a 
different predictive method as viewed by the same QM. 
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Table 6. Maximum MSE for QM availability predictions for individual indexers 



QM 


mean of 
all data 


single 
last obs 


running 

average 


low pass 
filter 


timed low pass 
tlpf value method A 


filter 
method B 


cs-tr 


0.47 


0.39 


0.47 


0.39 


0.39 


0.40 


0.40 


ncstrl 


0.44 


0.37 


0.45 


0.37 


0.37 


0.37 


0.45 


berkeley 


0.50 


0.38 


0.50 


0.38 


0.38 


0.38 


0.41 


lite 


0.49 


0.28 


0.50 


0.28 


0.29 


0.29 


0.31 


forth 


0.36 


0.39 


0.75 


0.39 


0.39 


0.38 


0.75 



We now wish to compare availability prediction accuracy with the data 
consistency: if an indexer consistently responds (or consistently doesn’t respond) to 
the QM before the search timeout, then we would expect our predictions to be very 
accurate. Our measurement of data consistency is the availability ratio: the number of 
times an indexer responded to the QM before the search timeout divided by the 
number of times the QM attempted to contact the indexer. 

Table 7 compares the availability ratio of indexers contacted by the CS-TR QM 
and the MSE of the single last observation prediction method. We see that in general, 
as the availability ratio increases, the MSE for the single last observation decreases. 
This meets our expectations of highly accurate predictions when indexer behavior is 
highly consistent - when indexers have availability ratios of 0.98, 0.99, or 1.00, the 
MSE is 0.02 or less. 

Note that due to the nature of the availability measurements (1 if the indexer 
responded to the QM before the search timeout, 0 if the indexer didn’t respond to the 
QM in time), the mean of the availability data is the same as the availability ratio. If 
we were comparing the MSE for the mean of all data predictive method with the 
availability ratio, we would expect a perfect inverse relationship between availability 
and MSE (MSE = 1 - availability ratio) for all indexers responding more than half the 
time. Likewise, we would expect a direct relationship (MSE = availability ratio) for 
all indexers with an availability ratio of 0.50 or less. If we compare the mean of all 
data MSE values in Table 4 with the availability ratios in Table 7, we can see that this 
is true. 

In Table 7 there are four indexers for which the MSE is significantly different than 
we would expect: lite.ncstrl.org:3803, www.cc.gatech.edu:81, www.ics.forth.gr:7000, 
and www.icase.edu:80. The first three indexers are instances in which the single last 
observation predictive method out-performs the mean of all data predictive method, 
while the last instance is one in which the single last observation method does worse 
than the mean of all data method. 
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Table 7. Single last observation availability prediction MSB compared with 
availability ratios for indexers contacted by CS-TR QM (sorted by availability ratio) 



indexer 


no. obs 


availability 


single last 
obs MSB 


ncstrl.cc.vt.edu:8081 


1,115 


0.39 


0.39 


lite.ncstrl.org:3803 




0.53 


0.29 


cs.nyu.edu:80 


1,869 


0.61 


0.37 


www.es. umd.edu:80 


2,322 


0.62 


0.35 


www.ee. gatech.edu:81 




0.68 


0.03 


ncstrl.es. Cornell. edu:8090 


13,057 


0.87 


0.17 


www.es. dartmouth.edu:80 


2,510 


0.87 


0.15 


ncstrl.ee. vt.edu:8080 


16,957 


0.88 


0.12 


www.es. umass.edu:80 


13,993 


0.89 


0.15 


www.ics.forth.gr:7000 


5,162 


0.89 


0.06 


www.icase.edu:80 


13,761 


0.91 


0.14 


www.cs.utah.edu:80 


910 


0.94 


0.06 


ncstrl.ee. vt.edu:8090 


15,824 


0.95 


0.07 


www.cs.uiuc.edu:80 


2,785 


0.98 


0.02 


cs-tr.cs.comell.edu: 80 


14,868 


0.99 


0.01 


ei.es. vt.edu:8090 


13,653 


0.99 


0.02 


www.tc.comell.edu: 80 


13,988 


0.99 


0.00 


dri.comell.edu: 80 


14.172 


1.00 


0.01 


all indexers 


160,392 


0.89 


0.10 



The comparisons of single last observation MSB to availability ratios for the other 
four QMs in our study are similar to those presented in Table 7. In Table 8 we 
compare the availability ratios for the combined data of all polled indexers for each 
QM with the MSB for the single last observation method. 

Table 8. Single last observation availability prediction MSB and availability ratio of 
all indexers (sorted by availability ratio) 



QM 


no obs 


availability 


single last 
obs MSB 


forth 


743 






berkeley 


69,483 


WSm 




ncstrl 


113,511 






cs-tr 




0.89 


0.10 


lite 


6.363 


0.91 




all QMs 


350,492 


0.87 


0.10 
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As in Table 7, in Table 8 we see that as the availability ratio increases, the MSE 
decreases. In other words, as the availability becomes more consistent, the accuracy 
of our predictions increases. 

Table 8 also indicates that the availability ratio for all observations is 0.87: QMs 
received a response from a polled indexer before the search timeout 87% of the time. 
We could also say that the consistency of the data is high. (The consistency would be 
equally high if the availability ratio for all observations was 0.13). This high 
consistency of the data is the largest factor for the overall accuracy of 90% for our 
single last observation predictions for all indexers polled at all QMs. 



5 Predictions of Indexer Response Time 

Our goal is to predict QM-view indexer behavior as accurately as possibly so the QM 
can choose fast, reliable indexers for query distribution when presented with 
overlapped or replicated indexers. Recall that QM-view indexer predictions not only 
need to address the availability question (see section 4), but also the following 
question: 



Response time: how quickly will the indexer respond to the QM, given that it 
responds before the search timeout? 

We made QM-view indexer response time predictions made with the same seven 
predictive methods used to make availability predictions. As with the availability 
predictions, for timed low pass filter methods A and B we used the running average for 
Q in Equation 3 and we initialized data structures at elapsed time zero to the mean of 
the observations. We approximated the overhead time for the QM to compute 
predictions (x in Equation 4) as 2 seconds, and we assumed the data structure updates 
occurred in the same moment (within one second) that the indexer performance data 
was recorded in the logs. 

In this section, we explain how we chose m for f/pf methods {method A, method B, 
and filter value). We then analyze the accuracy of the response time predictions for all 
seven methods, determining which methods are most accurate. Last, we compare the 
prediction accuracy to the consistency of the response time data itself. 



5.1 Tuning the Timed Low Pass Filter 

As demonstrated in section 4. 1 for the availability predictions, the m values for the 
timed low pass filter algorithms for response time predictions did not adhere to one 
pattern. Therefore, as with the availability predictions, we chose the “best” m value 
out of the 16 we tried for each indexer contacted by each QM for each f/pf method for 
response time predictions. We used that “best” m value when comparing tipf 
predictive methods against the other predictive methods for each indexer at each QM. 

When we looked at the MSE for a tIpf method for “all indexers” contacted by a 
QM, we chose the best m value for each indexer. In other words, the f/pf method for 
“all indexers” does not use one single m value for the QM, but a separate m value for 
each indexer contacted by the QM in this combined data. 
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5.2 Response Time Prediction Results 

We evaluated our predictions by comparing them to observations using mean square 
error (MSB). 

Table 9 shows the MSB for each of the response time predictive methods for QM 
CS-TR, and in the rightmost column the difference between the highest MSB and the 
lowest MSB for any method for each indexer. 

Table 9. Response time prediction MSB for QM CS-TR 



indexer 


no. obs 


mean of 
all data 


single 
last obs 


running 

average 


low pass 
filter 


timed low pass filter 
tipf value method A method B 


range of 
MSEs 


cs-tr.cs.comell.edu: 80 


14,748 


1.7 


1.3 


1.7 


1.5 


1.3 


1.4 


1.6 


0.4 


cs.nyu.edu: 80 


1,135 


12.6 


21.9 


12.5 


20.5 


12.6 


12.5 


12.5 


9.5 


dri.comell.edu: 80 


14,119 


1.9 


3.4 


2.2 


3.3 


1.9 


2.2 


1.9 


1.5 


ei.cs.vt.edu: 8090 


13,462 


2.3 


1.9 


2.5 


1.9 


1.9 


1.9 


2.3 


0.6 


lite.ncstrl.org:3803 


5,457 


10.8 


14.6 


11.4 


13.7 


10.8 


11.4 


10.8 


3.8 


ncstrl.cc.vt.edu: 8080 


14,925 


15.6 


24.3 


16.6 


23.2 


15.6 


16.6 


15.6 


8.8 


ncstrl.cc.vt.edu: 808 1 




51.3 


65.7 


51.9 


63.1 


51.3 


51.9 


45.2 


20.5 


ncstrl.cc.vt.edu: 8090 




6.4 


9.9 


6.9 


9.3 


6.4 


6.9 


6.4 


3.6 


ncstrl.cs.comell.edu: 8090 


11,397 


9.7 


7.9 


10.1 


7.4 


7.1 


7.7 


8.3 




www.cc.gatech.edu: 8 1 




5.8 


8.6 


5.8 


8.1 


5.8 


5.8 


5.7 


2.9 


www.es . dartmouth.edu: 80 


2,185 


6.4 


9.4 


6.6 


8.8 


6.4 


6.6 


6.4 




www.cs.uiuc.edu: 80 


2,735 




1.7 


1.9 


1.8 


1.0 


1.7 




0.9 


www.cs.umass.edu: 80 


12,474 




13.8 


9.7 


12.9 




9.7 


9.5 


4.3 


www.cs.umd.edu: 80 


1,433 


8.8 


8.1 


8.8 


7.6 


7.4 


7.7 


7.9 


1.4 


www.cs.utah.edu: 80 


852 


1.3 


2.6 


1.3 


2.2 


1.3 


1.3 


1.3 


1.3 


www.icase.edu: 80 




4.5 


8.2 


5.4 


7.8 


4.5 


5.4 


4.5 


3.7 


www.ics. forth, gr: 7000 




7.3 




7.1 


9.4 


7.3 


7.1 


Mol 


3.0 


www.tc.comell.edu: 80 


13,893 


1.4 


1.3 


1.7 


1.4 


1.3 


1.3 


1.4 


0.4 


all indexers 


143,427 


6.2 


8.5 


6.6 


8.1 


5.9 


6.3 


6.0 


2.6 



Table 9 has a minimum MSB value of 1.0 for the mean of all data, tipf value and 
tipf method B methods for indexer www.es. uluc.edu:80 and a maximum MSB value of 
65.7 for the single last observation method for indexer ncstrl.cc.vt.edu:8081. 
Comparing different methods for a single indexer, the range of the MSB values can be 
as narrow as 0.4 (cs-tr.cs. Cornell. edu:80 and www.tc.cornell.edu:80) and as wide as 
20.5 (ncstrl.cc.vt.edu:8081). So unlike the availability predictions, the response time 
predictions are not similarly accurate for all predictive methods. 

Recall that for the availability predictions, the single last observation method gave 
the same MSB as the low pass filter method (with m = 0.95). For the response time 
predictions in Table 9, the single last observation and the low pass filter methods have 
similar MSB values, but they aren’t exactly the same. The running average and mean 
of all data MSBs resemble each other — but recall that the mean of all data was 
computed with the same data we are predicting against (i.e. there was no cross 
training), while the running average method doesn’t require foreknowledge. When 
comparing the single last observation and the running average methods, 12 of 18 
responding indexers in Table 9 had a lower MSB with the running average method. 

For 13 out of the 18 indexers responding to QM CS-TR, tipf method B had the 
lowest MSB value, either singly or tied with another predictive method’s MSB. The 
remaining 5 indexers maximized prediction accuracy with the tipf value method, and 
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sometimes with the single last observation method as well. In fact, when tipf method 
B was not the most accurate method, then the single last observation method 
outperformed the running average method; in cases where the running average did 
better than single last observation, tipf method B was the most accurate. 

Since m values of f/pf methods are not cross-trained, it is unsurprising that one of 
the f/pf methods can match or beat the methods not benefiting from prescience. What 
is interesting is that for QM CS-TR, the running average method is never more than 
1.0 away from the most accurate method, except in the cases of indexer 
ncstrl.cc.vt.edu:8081, where tipf method B is 6.1 lower than the next most accurate 
method, and indexer ncstrl.cs.cornell.edu:8090, where the single last observation 
method is 7.9, the tipf value method is 7.1 and the running average is 10.1. So for 
nearly all indexers responding to QM CS-TR, the running average is within 1.0 of the 
best method even when the range of the MSE values is as high as 8.8 (indexer 
ncstrl.cc.vt.edu:8080) or 9.5 {cs.nyu.edu:80), but requires no prescience or 
complicated calculations of prediction algorithm variables. 

The MSE of response time predictions for the four other QMs in our study are 
similar to those presented in Table 9; we present the MSE for the combined data for 
all responding indexers for each of the QMs in our study in Table 10. 

Table 10. MSE for all response time predictions 



OM 


no. obs 


mean of 
all data 


single 
last obs 


running 

average 


low pass 
filter 


timed low pass filter 
tipf value method A method B 


range of 
MSEs 


cs-tr 


143,427 


6.2 


8.5 


6.6 


8.1 


5.9 


6.3 


6.0 


2.6 


ncstrl 


99,406 


7.8 


10.2 


7.7 


9.7 


7.6 


7.5 


7.5 


2.8 


berkeley 


54,852 


15.2 


24.7 


15.2 


23.4 


15.1 


15.2 


15.0 


9.6 


lite 


5,811 


9.9 


14.8 


9.9 


14.0 


9.9 


9.8 


9.2 


5.6 


forth 


479 


67.1 


118.8 


69.4 


113.2 


67.1 


69.2 


64.0 


54.8 



Table 10 shows that the CS-TR QM not only has the most accurate predictions in 
our study on average, but that the different prediction methods are most consistent for 
the CS-TR QM - the MSE values only range from 5.9 to 8.5 for this QM. The 
FORTH QM has the least accurate response time predictions in our study, on average, 
and their accuracy varies widely depending on the prediction method, from a MSE of 
64.0 for tipf method B to 1 18.8 for the single last observation method. 

The most accurate method for predicting response times for the CS-TR QM, when 
examining the MSE for all responding indexers combined, is the tipf value method. 
Tipf method B performs best for the other four QMs (though tipf method A performs 
equally well as method B for the NCSTRL QM), and for the CS-TR QM, the MSE for 
tipf method B is only 0.1 higher than that for the tipf value prediction method. 
However, the running average method, which is simpler to compute and uses no 
foreknowledge, performs nearly as well. For four of the QMs (all but FORTH), the 
minimum MSE value is within 0.7 of the MSE value for the running average. For 
two of these QMs (NCSTRL and BERKELEY), the running average MSE is within 0.2 
of the most accurate method. Even for the FORTH QM, the MSE for the running 
average method, 69.4, is much closer to the lowest MSE value of 64.0 than it is to the 
highest MSE value of 118.8. Moreover, since the FORTH QM has far fewer 
observations than the other QMs, we can still assert that for all response time 
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predictions in our study, the running average method performs nearly as well as any 
of the f/pf methods or the mean of all data method. 

If we take the root mean square (RMS), or the square root of the MSB, then we get 
a rough approximation of the average error for a prediction. For the CS-TR QM, the 
running average predictions on average are within SQRT (6.6) = 2.6 seconds of the 
response time observations. For the BERKELEY QM, the running average predictions 
are generally within SQRT (15.2) = 3.9 seconds of the response time observations. 
For the FORTH QM, the running average predictions on average are within SQRT 
(69.4) = 8.4 seconds of the response time observations. 

In order to further compare predictive methods and examine prediction accuracy, 
we now present Table 1 1, which shows the highest MSB for each predictive method 
for an individual indexer at each QM. Note that multiple indexers contacted by each 
QM may be represented in Table 11: the indexer with the highest MSB for one 
predictive method may be different from the indexer with the least accurate 
predictions for a different predictive method as viewed by the same QM. 

Table 11. Maximum MSB for QM response time predictions for individual indexers 



QM 


mean of 
all data 


single 
last obs 


running 

average 


low pass 
filter 


timed low pass filter 
tipf value method A method B 


cs-tr 


51.3 


65.7 


51.9 


63.1 


51.3 


51.9 


45.2 


ncstrl 


21.0 


30.3 


21.1 


28.6 


21.0 


21.1 


21.0 


berkeley 


38.5 


57.4 


38.0 


54.4 


38.5 


37.9 


37.8 


lite 


23.0 


50.2 


23.7 


47.7 


23.0 


23.7 


23.0 


forth 


95.2 


169.1 


97.1 


161.1 


95.2 


96.9 


88.5 



In Table 11, tipf method B has the lowest maximum MSB values for individual 
indexer predictions at all QMs, though the tipf value and mean of all data methods 
equal the tipf method B MSB value at the NCSTRL and LITE QMs. For three QMs, 
the worst MSB value for an individual indexer using the running average prediction 
method is very close to the worst MSB value using tipf method B: for NCSTRL, the 
running average MSB is only 0.1 greater than the tipf method B MSB, for BERKELEY 
it’s within 0.2 and for LITE it’s within 0.7. For the CS-TR QM, the maximum 
individual indexer MSB for tipf method B is 45.2, while the least accurate predictions 
using the running average method give a MSB of 51.9. This is a discrepancy of 6.7, 
while the discrepancy for the FORTH QM is 8.6. While these discrepancies seem 
large, we need to remember that this is a measurement looking at the average of the 
squared prediction error. 

If we examine the RMS for these worst case running average predictions for 
individual indexers represented in Table 11, then the running average method is 
within SQRT (97.1) = 9.9 seconds of response times, on average, for the least 
accurate individual indexer predictions. If we ignore the FORTH data, then that 
calculation drops to SQRT (51.9) = 7.2 seconds: running average predictions are 
within roughly 7 seconds of response times, on average, for the least accurate 
individual indexer predictions at the CS-TR QM. (Recall from Table 10 that overall, 
CS-TR QM predictions are within 2.6 seconds of the observed response times.) 
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We will now compare response time prediction accuracy with data consistency; 
our measurement of data consistency is the variance of the observed response times. 

Table 12 compares the mean response times and variances for indexers responding 
to the CS-TR QM and with the MSE and RMS of the running average prediction 
method. We see that in general, as the variance for an indexer’s response times 
increases, the MSE for the running average prediction method increases as well. 
There are exceptions to this: www.es. utah.edu:80 has a variance of 1.3 and an MSE of 
1.3, while www.es. uiuc.edu:80 has a variance of 1.0 and an MSE of 1.9; two other 
exceptions occur for indexers dri.eornell.edu:80 and www.es. dartmouth.edu:80. 

Table 12. Running average response time prediction MSE and RMS compared with 
mean response times and variances for indexers responding to the CS-TR QM (sorted 
by variance) 



indexer 


no. obs 


mean 

resDtime 


variance 


running 
av2 MSE 


running 
ave RMS 


RMS 
/ mean 


es-tr.cs. Cornell. edu:80 


14,748 


1.9 


0.8 


1.7 


1.3 


68% 


WWW. tc.eornell.edu: 80 


13,893 


1.7 


1.0 


1.7 


1.3 


76% 


WWW. cs.uiue.edu: 80 


2,735 


3.0 


1.0 


1.9 


1.4 


46% 


WWW. cs.utah.edu: 80 


852 


0.3 


1.3 


1.3 


1.2 


384% 


ei.cs.vt.edu:8090 


13,462 


1.8 


1.6 


2.5 


1.6 


87% 


dri.eornell.edu:80 


14,119 


3.2 


1.9 


2.2 


1.5 


47% 


www.icase.edu:80 


12,473 


2.0 


4.5 


5.4 


2.3 


116% 


WWW. cc.gateeh.edu: 81 


2,097 


2.7 


5.3 


5.8 


2.4 


89% 


ncstrl.ce.vt.edu:8090 


15,004 


4.2 


6.3 


6.9 


2.6 


63% 


WWW. cs.dartmouth.edu: 80 


2,185 


7.0 


6.4 


6.6 


2.6 


37% 


WWW. ics. forth. gr:7000 


4,598 


5.5 


7.1 


7.1 


2.7 


48% 


www.cs.umd.edu:80 


1,433 


10.9 


8.0 


OO 

bo 


3.0 


27% 


www.cs.umass.edu:80 


12,474 


4.7 


9.4 


9.7 


3.1 


66% 


ncstrl.es. eornell.edu: 8090 


11,397 


4.4 


9.5 


10.1 


3.2 


72% 


lite.nestrl.org:3803 


5,457 


7.2 


10.7 


11.4 


3.4 


47% 


es.nyu.edu:80 


1,135 


2.8 


11.9 


12.5 


3.5 


126% 


ncstrl.ee. vt.edu: 8080 


14,925 


5.1 


15.6 


16.6 


4.1 


80% 


ncstrl.ee. vt.edu: 8081 


440 


13.3 


51.3 


51.9 


7.2 


54% 


all indexers 


143,427 


3.6 


6.3 


6.6 


2.6 


71% 



The root mean square (RMS) column in Table 12 indicates how much, on average, 
a running average prediction differs from the corresponding actual response time. In 
the best case for the CS-TR QM, running average predictions average a 1.2 second 
discrepancy from observations (indexer www.es. utah.edu:80), in the worst case the 
discrepancy is 7.2 seconds {nestrl.ee.vt.edu:8081), and overall, the discrepancy 
averages 2.6 seconds. 

When we say that running average response time predictions, on average, are 
within 2.6 seconds of observed response times for the CS-TR QM, that sounds pretty 
good . . . until we examine the mean response time for all indexers responding to the 
QM: 3.6 seconds. So our running average predictions, on average, are 71% off from 
the mean of all response times. The best results are for indexer www.es. umd.edu:80, 
where the running average RMS is 3.0 seconds and the mean response time is 10.9 
seconds: for this indexer, the RMS is only 27% of the mean. However, this is the 
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indexer with the second highest mean response time for the CS-TR QM. The indexer 
with the lowest mean response time of 0.3 seconds, www.es. utah.edu:80, has a 
running average RMS of 1.2 seconds, which is 384% of the mean - the highest 
percentage of any indexer. 

Perhaps the most positive information from Table 12 is that the running average 
RMS is 4.1 seconds or lower for all indexers except ncstrl.cc.vt.edu:8081 , which only 
had 440 of the 143,427 observations recorded by QM CS-TR. These RMS figures for 
predictive methods, combined with the variance or standard deviation of indexer 
response times could inform the search timeout values. 

The comparisons of running average MSB to mean and variance of response time 
data for the other four QMs in our study are similar to those presented in Table 12. In 
Table 13 we compare the mean response times and variances for the combined data of 
all responding indexers for each QM with the MSB for the running average method. 

Table 13. Running average response time prediction MSB and RMS compared with 
mean response time and variance of all responding indexers (sorted by variance) 



QM 


no. obs 


mean 

resptime variance 


running 
avg MSE 


running 
avg RMS 


RMS 
/ mean 


cs-tr 


143,427 


3.6 


6.2 


6.6 


2.6 


71% 


ncstrl 


99,406 


6.6 


7.8 


7.7 


2.8 


42% 


lite 


5,811 


3.4 


9.9 


9.9 


3.1 


92% 


berkeley 


54,852 


6.8 


15.2 


15.2 


3.9 


57% 


forth 


479 




67.2 


69.4 


8.3 


83% 


all QMs 


303,975 


5.2 


8.5 


8.7 


2.9 


57% 



In Table 13, as in Table 12, we see that as the variance increases, the MSB 
increases: as the consistency of the data decreases, the accuracy of our predictions 
decreases. Almost half of the response time predictions occur at QM CS-TR; nearly a 
third occur at NCSTRL: these QMs have the lowest running average RMS, 2.6 
seconds and 2.8 seconds, respectively. So nearly two thirds of the response time 
predictions have a discrepancy of no more than 2.8 seconds, on average, from 
observed response times. All response time predictions at all QMs are 2.9 seconds 
away from the observed response times, on average. 

The mean response time for all indexers responding to all QMs in our study is 5.2 
seconds; the RMS for the running average prediction method is 2.9 seconds. So our 
running average predictions for all response time data are 57% off from the mean for 
all response time data. The QM with the lowest ratio between running average RMS 
and mean response time is NCSTRL; the worst ratio occurs at the LITE QM, with an 
RMS of 3.1 seconds and a mean response time of 3.4 seconds. 

The variance for all the response time data in our study is 8.5 seconds; the standard 
deviation is SQRT(8.5) = 2.9 seconds. This is nearly half of the mean response time 
of 5.2 seconds for all the data in our study; therefore, on average, the response time 
data is not highly consistent, or tightly clustered around the mean. This is the largest 
factor in the accuracy or lack thereof in our response time predictions. 
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6 Conclusions 

In this paper, we examined the accuracy of indexer performance predictions made 
from the query mediator perspective. Such predictions could be used to inform a 
QM’s selection of indexers for query distribution when optimizing for performance 
(choosing the fastest, most reliable indexer among a set of overlapped or replicated 
indexers). By improving the QM choice of indexers, we reduce the time a QM spends 
waiting for indexers to respond, and hence reduce the time a user waits for search 
results from a distributed digital library. Response time predictions (and a 
measurement of their accuracy) can also be used to inform search timeout choices at 
the QM: judiciously chosen search timeouts could reduce QM wait time and thus, 
user wait time. 

We investigated the accuracy of various algorithms when predicting indexer 
availability and response times from the QM-view, and learned, unsurprisingly, that 
prediction accuracy is related to data consistency. We also learned that the simple 
prediction algorithms we used performed as well as our complex algorithms. The 
primary utility of our work is that it quantifies the accuracy of different prediction 
methods as well as the consistency of observed data from an operational, world- 
distributed digital library. 

With respect to indexer availability from the QM-view, our combined data had an 
87% response rate. Approximately 90% of our indexer availability predictions were 
accurate regardless of the predictive method used - this high accuracy is due to the 
high consistency of the data. One of the simplest prediction methods, the single last 
observation method (in which the previous observation is used as a prediction value), 
performed as well as or slightly better than other predictive methods. 

QM-view predictions of indexer response times were not similarly accurate for all 
predictive methods. Timed low pass filter method B was the most accurate, but the 
simpler running average method performed nearly as well. Timed low pass filter 
methods were not cross-trained, and hence their results might be unduly accurate. 
The accuracy of response time predictions is closely related to the standard deviation 
of the observed response times - again, we see a correlation between data consistency 
and prediction accuracy. 

On average, for all QMs in our study, the running average predictions differed 
from the observed response times by 2.9 seconds, and 2.9 seconds is also the standard 
deviation of all observed response times in our study. The poorest response time 
predictions for a particular indexer as viewed by a particular QM are a more 
appropriate measurement for adjusting QM search timeouts. Running average 
predictions differed by 9.9 seconds or less, on average, from the least accurate 
predictions for individual indexers. Ignoring the small amount of data from the 
FORTH QM, running average predictions differ by roughly 7 seconds from observed 
response times for the least accurate predictions for individual indexers. 

None of the prediction algorithms we used required on-going heavy computation 
from the QM or more than minimal indexer performance monitoring by the QM. Our 
future work will involve further explorations of how to improve the resource 
discovery process in distributed digital libraries, including the application of QM- 
view indexer performance monitoring and predictions to improve QM performance. 
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Abstract. As current digital libraries are becoming more complex, the facilities 
provided by them will increase and the difficulty of learning associated with the 
complexity of using these facilities will also increase. In order to produce 
usable and useful interactive systems, designers need to ensure that good design 
features are incorporated into the systems, taking into consideration end-users’ 
needs and cultural backgrounds. We carried out a study to investigate useful 
design features digital libraries should have. The study provides insights on the 
usability impact of digital libraries for task completion and end-users ’perceived 
impressions on the effectiveness of the digital libraries. The results also suggest 
that there is little provision on the interface to cater to end-users’ browsing and 
inter-cultural needs. Hence, this paper also discusses design guidelines for the 
design of user-centred digital libraries. 



1 Introduction 

The growing popularity of the Internet and advancements in networking has brought 
about networked hypertext systems such as the Web. In recent years, the Web, the 
overwhelming example of a shared world-wide collection of information, has been 
extended to include many digital libraries by individuals or groups that select, 
organise, and catalogue large numbers of documents. 

Although there is as yet no consensus on the definition of digital libraries, they are 
generally referred to as "collections of information that are both digitised and 
organised" [13], and give us opportunities we never had with traditional libraries or 
even with the Web. Digital libraries are emerging and the digital computer is the 
technology that has enabled Bush’s "memex" to be finally realised [3]. For 
universities and libraries to retain their status and relevance, they have to participate 
in the new digital world, as indeed many already do such as the British Library, The 
Library of Congress in the United States, etc. 

Although a significant resource of digital libraries has been established with a large 
number of potential users, a pressing research challenge still remains in developing 
appropriate facilities to promote world-access and use of the growing of digital 
information. Several studies have shown that users have great difficulty using 

S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL ‘99, LNCS 1696, pp. 167-183, 1999 
© Springer-Verlag Berlin Heidelberg 1999 
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relatively basic Online Public Access Catalogues (OPACs). These difficulties are in 
part caused by the conflation of a number of problems: 

• difficulty of learning to use any new piece of software; 

• difficulty for a non-expert to learn the organisation of information in a library; 

• difficulty of learning the particular details of organisation in an unfamiliar library; 
and 

• difficulty of using Boolean search operators for many users. 

As the facilities provided by on-line resources increase, it is likely that the 
difficulty of learning associated with complexity in using these facilities will persist 
and may continue to increase. 



2 Problems with Digital Libraries 

Current digital libraries are becoming more complex systems which include text 
search, functionality relating to hypertext, multimedia, the Web and highly interactive 
interfaces [20]. If we have problems producing good web sites as evidenced by much 
research done to address problems on the Web [26], then it would not be unreasonable 
to anticipate that we will have problems creating good digital libraries! This is 
because digital libraries are more than just web sites or stores of information in digital 
libraries. Designers need to provide efficient ways to structure information, and 
represent them digitally using computers. To design good, usable digital libraries, one 
requires knowledge about who will use them, what they will be used for, the work 
context and the environment in which they will be used, and what is technically and 
logistically feasible. This is all in addition to the usual usability concerns, such as the 
tasks and populations of users. 

This complexity is further compounded by the fact that designers, content 
providers, and users can have very different cultural backgrounds. Although 
information in digital libraries is supposed to be available globally, its design, content 
provision, and use have remained local. This cultural diversity raises a number of 
questions regarding the cross-cultural usability of digital libraries. 

Designing good, usable interfaces is not an easy task. Dix et al. argue that even if 
one has used the best methodology and model in the design of a usable interactive 
system, one still needs to assess the design and test the system to ensure that it 
behaves as expected and meets end-users’ requirements [8]. Landauer points out that 
it is impossible to design an optimal user interface in the first try. If information 
access systems are to provide good, usable interfaces, designers must conduct some 
form of testing on the interface [II]. However, without knowing where in a system 
users run into problems, one has little hope of improving the system [18]. 



3 Our Study 



This paper presents a study we carried out as part of a project funded by the UK’s 
Science Research Council (EPSRC). The objectives of our work are to: 
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• investigate useful design features digital libraries should have by examining three 
sample digital libraries; 

• study the effects of the lack of these design features on end-users’ performance in 
terms of task completion; and 

• propose basic design features for the design of digital libraries that will take into 
account end-users’ needs. 

The pilot work we have done suggests lots of exciting avenues to research in greater 

depth. 



3.1 Protocol 

Ten computing staff and students were selected to evaluate three sample digital 
libraries: the Networked Computer Science Technical Reference Library (NCSTRL), 
the New Zealand Digital Library (NZDL) and the ACM Digital Library (ACMDL). 
These three digital libraries were chosen because they are available to the general 
public, and are one of the better examples of digital libraries found on the Web in 
terms of its information and coverage. 

NCSTRL is an international collection of computer science research reports and 
papers made available for non-commercial use from more than 100 participating 
institutions and archives (see http://www.ncstrl.org/). ACMDL consists of a vast 
resource of bibliographic information, citation and full-text articles (see http://www. 
acm.org/). NZDL comprises several demonstration collections such as computer 
science technical reports, literary works, internet FAQs, and the Computists 
Communique magazine (see http://www.nzdl.org/). 

Seven of the subjects were researchers with some experience in using digital 
libraries. The other three subjects were non-researchers with no experience with 
digital libraries but used the web often. Since we are interested in investigating end- 
users’ performance when using digital libraries, we provided the subjects with two 
tasks that involved search and browsing. We define browsing to refer to "navigating 
without any specific goal or purpose. Searching refers to "examining or looking 
carefully in order to find information". The subjects could choose how long to spend 
on each task. Table 1 shows the two tasks. 

Table 1. Information retrieval tasks given to subjects 



Task 


Description of task 


Search 


Find a journal article given author’s name, title of article, 
title of journal and year of publication. 


Browse 


Find all articles by an author between 1996 and 1999, 
given author’s name. 



After they had completed the tasks using ah three of the digital libraries, they were 
asked to complete an extensive questionnaire (see section following on the description 
of the questionnaire) commenting on how satisfied they were with the design and 
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structure of the digital libraries in helping them to complete the tasks successfully. If 
not, they would explain the reasons for not being able to complete the tasks 
successfully. Satisfaction refers to the "feeling of being pleased with the digital 
library in helping to complete the task successfully". Being pleased is defined in terms 
of the subjects’ perceived ease of use, rate of errors, and time taken to perform the 
task successfully. 



3.2 Questionnaire 

The formulation of the questionnaire was greatly inspired by the development of a 
measurement tool called the Questionnaire for User Interface Satisfaction (QUIS) by 
Chin, Diehl and Norman [5]. QUIS measures end-users’ subjective ratings of the 
interface of an interactive system. According to Chin, Diehl and Norman, even though 
several questionnaires have been developed to assess end-users’ perceptions of 
interactive systems, their weaknesses range from a lack of validation [4] to low 
reliability [12]. Chin, Diehl and Norman claimed QUIS is reliable. The design of the 
questionnaire was modelled closely after QUIS, adapted for digital library, because of 
its reliability as claimed by these authors. 

To select the relevant areas to measure usability, we turned to Lingaard’s 
classification of typical usability defects for interactive systems which include [14]: 
navigation; screen design and layout; terminology; feedback; consistency; modality; 
redundancies; end-user control and match with end-user tasks. Inspired by Lingaard’s 
classification of usability defects in interactive systems, we then formulated the 
general design categories for evaluating digital library into nine areas G1 to G9. 

Gl. Overall reactions to digital library 

This area evaluates end-users’ overall perception of the performance of 
hypertext in terms of satisfaction, completion of tasks and appeal. 

G2. Screen display 

This area measures how clearly information is organised and displayed on 
the screen. 

G3. Terminology and system information 

This area examines whether digital library is consistent in the use of 
terminology, word and format. It also asks if the system provides feedback, 
and whether error messages are useful. 

G4. Learning 

This area investigates the ease of use of the digital library. 

G5. System capabilities and user control 

This area examines digital library’s response time, reliability and recovery 
process. 

G6. Digital library site customisation 

This area examines whether the designers have taken into consideration end- 
users’ experience and inter-cultural needs. 

G7. Navigation 

This area asks questions on how clearly are the navigational elements such 
as maps, table of contents, etc. displayed. It also investigates whether the 
end-user is "lost", and the reasons why. 
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G8. Information retrieval 

This area asks questions on how the quality of search facilities, quality of 
search results, and ease of retrieval/downloading of information. 

G9. Completing tasks 

This area examines the extent of usefulness of facilities in digital library in 
helping end-users to complete their tasks in browsing and searching. 

For the purpose of our study, 36 out of 40 questions were closed questions, since 
they are generally easier to analyse than open questions [19]. Responses obtained 
from closed questions can be easily converted into numerical values and a statistical 
analysis can be performed. Four open questions were asked since they encourage 
"freer" answers from respondents, hence provide a rich source of data, which may 
otherwise go undetected. 

Generally, end-users prefer concrete adjectives for evaluations [7], therefore the 
questionnaire used a semantic differential scale, a popular form of attitude scale 
widely used in FlCl research, to measure end-users’ responses [19]. This scale has bi- 
polar adjectives at the end-points, and respondents rate the user interface on a scale 
between these paired adjectives by putting a tick in the appropriate column . 
'iMisalisfactory' 





eslrennely 


quite 


slighdy 


neuIrsJ 


lightly 


quite 


eareiriely 




poor 


1 


2 


3 




5 


6 


7 


good 



Fig. 1. A 7-point scale to measure end-users’ responses 

For easier analysis and display of results, the semantic differential scale used in the 
questionnaire was translated into a 7-point scale (see figure 1). For example, number 
1 represents "extremely poor" and number 7 represents "extremely good". A value "5 
and above" is considered "good", implying that end-users are generally pleased with 
the digital library and designers need not make any changes. A value "3 and below" is 
deemed "unsatisfactory" indicating end-users’ dissatisfaction with the digital library, 
and designers should make necessary changes to correct the deficiency. A mid-value 
of 4 is taken to be "neutral", and probably designers should find out more from end- 
users and make changes if required. 



4 Results and Analysis 

We report our results and analysis of subjects’ feedback and performance under the 
following sub-sections: 

• task completion rates and subjects’ perceived overall impressions of the digital 
libraries 

• subjects’ perceived impressions of successful implementations of design categories 
in the digital libraries 




172 Y.L. Theng et al. 

4.1 Overall Impressions and Task Completion Rates 



Table 2 shows the success rates in completing the 2 tasks by the ten subjects. Table 3 
shows subjects’ perceived overall impressions of the digital libraries. 



Table 2. Task success rates indicating the percentage of users in group that managed 
to complete the tasks. 



Tasks 


NCSTRL 


ACMDL 


NZDL 


Search task 


80% 


0% 


50% 


Browse task 


80% 


100% 


40% 



Table 3. Subjects perceived overall impressions of the digital libraries 



Overall impressions 


NCSTRL 


ACMDL 


NZDL 


Usability of digital library 


100% 


100% 


30% 


Satisfaction when using the digital 
library 


90% 


80% 


20% 


Appeal of the digital library 


70% 


90% 


40% 


Flexibility of the digital library 


80% 


90% 


30% 


Effectiveness in helping with task 
completion 


80% 


70% 


50% 



Questionnaire results are not surprising. They reinforce the indication that end- 
users’ overall impressions of digital libraries are determined by how effective the 
digital libraries are in helping them to complete the tasks successfully. 

Search 

NCSTRL came up well with 80% of the subjects completing the search task 
successfully compared to 50% for NZDL and 0% for ACMDL. The five subjects who 
were unable to find the article on "Designing information-abundant web sites: issues 
and recommendations" by Ben Shneiderman [23] because it was not listed under the 
appropriate collection but classified as a "technical report". All the subjects were 
unable to complete the search task using the ACMDL because the article is not 
published in an ACM affiliated publication. The subjects were generally pleased with 
the usability of the ACMDL even though they were not successful in completing the 
search tasks. 

Browse 

All the subjects were able to complete the browse task using the ACMDL. NCSTRL 
was also effective in helping the subjects to complete the task (80%). Only 40% of the 
subjects completed the task using NZDL, and the reason being that the layout is 
confusing. All links to collections seem to lead to search boxes which produced 
unhelpful results. 
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4.2 Design Categories 

By analysing subjects’ responses under the nine design categories G1 to G9, we 
wanted to find out whether good design features were perceived by subjects to have 
been successfully implemented in the three digital libraries. For each question, 
subjects’ ratings were grouped under three categories: "3 and below"; "4"; and "5 and 
above". Frequencies under the respective areas were obtained. It is debatable but we 
make the assumption that if an area scores a percentage of 75 and above for ratings 
given in the "5 and above" category, it implies that, that area is well-implemented in 
the digital library in question. Table 4 compiles subjects’ ratings of the success of 
implementation of user interface design features of the three digital libraries based on 
the nine design categories. 



Table 4. Subjects’ ratings of the success of implementation of user interface design 
features of the three digital libraries based on the nine design categories. 



Design Categories 


NCSTRL 


ACMDL 


NZDL 


Gl: Overall reactions to digital library 


84% 


86% 


40% 


G2: Screen design 


62% 


86% 


47% 


G3: Terminology and system information 


60% 


76% 


52% 


G4: Learning 


74% 


90% 


62% 


G5: System capabilities and user control 


80% 


74% 


80% 


G6: Digital library customisation 


30% 


67% 


45% 


Gl: Navigation 


60% 


51% 


49% 


G8: Information retrieval 


85% 


79% 


70% 


G9: Completing tasks: Features to help 


67% 


84% 


66% 



We will now comment on the usability of each of the three digital libraries: 

Networked Computer Science Technical Reference Library (NCSTRL) 

Of the nine design categories, only systems capabilities and user control (G5: 80%) 
and information retrieval (G8: 85%) design categories were rated well by the subjects. 

Figure 2 shows part of the NCSTRLs search interface. This page is well laid out 
and is well designed in terms of the readability of the text and visibility of the status 
of the system. It provides help and documentation. The information contained in the 
search page is relevant and dialogues do not contain information that are irrelevant. 
Instructions to use the search feature is clearly given so end-users’ memory load is 
minimised. A good search feature allows searching to be performed at both general 
and specific levels. Search results returned also provide links to the document and 
authors of other works. This is a useful feature providing flexibility and efficency of 
use. 
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Fig. 2. Screenshot of NCSTRL’s search interface 

However, NCSTRL does not speak the users’ language. For example, the "sort by 
results" feature has an option "rank" which is unclear as to what it does. The "clear" 
button does not clear entries for the "search for ALL bibliographic fields". It does not 
support undo and redo functions well. There is no "exit" button to get out of the 
search results page. 

Figure 3 shows NCSTRL’s browse interface. The design of the browse page 
appears cluttered and the instruction to select the kinds of collection to browse is 
ambiguous. The scroll window to select the collections from participating institutions 
is too small making it inefficient to use. 




Fig. 3. Screenshot of NCSTRL’s browse interface 
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New Zealand Digital Library (NZDL) 

Of the nine design categories, only systems capabilities and user control (G5: 80%) 
design category was rated well by the subjects. Figure 4 shows part of the NZDL’s 
search interface. The interface design is simple and well-designed. The search 
function is well-designed, and the search results returned provided a lot more textual 
formats compared to the NCSTRL’s search results. This provides end-users with 
different views of the documents providing them with flexibility. It also gives the first 
three lines of the abstracts of the documents to give users some idea of the contents of 
the documents. 

Unlike NCSTRL, NZDL does not have a browse interface. This may restrict 
flexibility and efficiency of use. The organisation of the collections could be better 
improved by grouping them instead of providing a list of unrelated options. The icon 
on "view figures" do not work and no feedback is provided as to why it does not 
work. Subjects commented that this made searching and browsing difficult. Hence, 
overall NZDL was poorly rated. 




Fig. 4. Screenshot of NZDL’s search interface 



ACM Digital Library (ACMDL) 

Of the nine design categories, only systems capabilities and user control (G5: 74%), 
digital library customisation (G6: 67%) and navigation (G7: 51%) design categories 
was rated poorly by the subjects. Of the three digital libraries, ACMDL was perceived 
by the subjects to be better designed in terms of screen layout, terminology, learning, 
information retrieval and search features. Figure 5 and 6 show part of the ACMDL ’s 
browse and search interfaces respectively. 
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Fig. 5. Screenshot of ACMDL’s browse interface. 




Fig. 6. Screen shot of ACMDL’s search interface. 



4.3 Further Discussion 

Following description of our study, this paper now addresses two areas of design 
flaws that seemed evident in all three digital libraries: 

• Navigation in terms of end-users ' confidence in navigating within the digital 
library. From our investigation (see Table 4), navigation within the three 
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digital libraries is still not desirable ranging from moderate 49% (NZDL) to 
60% (NCSTRL). The subjects indicated that they experienced some degree 
of "lostness" ranging from 20% (ACMDL) to 60% (NZDL). We define the 
"lost in hyperspace" problem to refer to any of the following phenomenon 
[26]: 

the problem of not knowing where they are in the digital library (ranging 
from 30% to 40%); 

how to get to some other place they know (or think) exists in the digital 
library (ranging from 40% to 60%); 

how to return to a topic left previously (ranging from 50% to 80%); and 
the problem of forgetting the key points covered (ranging from 20% to 
60%). 

"Lostness" experienced by subjects can also have a negative impact on 
subjects’ rates of completion [25]. 

• Digital library customisation concerning end-users ’ browsing and cultural 
needs. Lack of consideration for end-users’ browsing needs ranges from 30% 
(ACMDL) to 80% (NZDL). The subjects also indicated that the digital 
libraries have not taken cultural needs into consideration, ranging from 30% 
(NZDL) to 90% (NCSTRL and ACMDL). One reason for the neglect of 
cultural aspects may be that usability failure is rather commonplace, and 
cultural usability issues are hard to recognise as such, more so since 
designers cannot help but see the world from their particular cultural point of 
view. Designers also typically invest a lot of effort getting systems to work 
at all, and may be defensive about their work. This usually bolsters another 
cultural barrier, one between professional designers and computer illiterate 
users or what system designers perceive as such. Thus, cultural usability 
issues for system designers may come disguised as illiteracy problems or 
simply as "user faults", rather than as surmountable cultural differences. 
However, from the above we can conclude that the state-of-the-art digital 
library interfaces are not yet prepared to fully meet the culturally specific 
needs of their international users. 



5 Design Lessons 

Our investigations highlight some ways in which digital libraries can be designed to 
make them more usable, more adaptive to end-users’ browsing and searching needs, 
and more culturally sensitive. 
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5.1 Provide Better Navigation Support Mechanisms to address the "Lost in 
Hyperspace" Problem 

To summarise, our brief analysis of NCSTRL, ACMDL and NZDL highlighted a 
number of points in which digital libraries could be improved. To address the "lost in 
hyperspace" problem in digital libraries, the best strategy is to consistently apply 
basic web document design principles on every single page in the digital libraries 
designers create [26]: 

• Meaningful document header to identify the content of the document. 

• Text-labelled navigation aids to indicate clearly their functions. 

• Page footer to identify the origin, authorship, author contact information, date of 
creation, copyright info, etc. 

• Sensible page length to prevent as little scrolling as possible. 

• Clear use of language to prevent confusion. 

• Simple features (not flashing and fancy ones) to make reading easier. 

• Hypertext links. End-users can move to related information by clicking onto 
hypertext links, represented by underlined text or figure. Hypertextual links 
should be embedded in the documents to provide end-users with the ability to 
move to related information quickly without having to waste time submitting 
another query and waiting for the query results. Links should provide a 
prospective view. Before end-users make the jump, they are given prospective 
information about the destination node (URL with path and filename) provided in 
the footer. It would be helpful to end-users to provide them with the abstract 
and/or outline of the document before they make the jump. 

• Bookmark. End-users can build a set of direct jumps to their favourite places in 
hyperspace using bookmarks. 

• History list. End-users can go back to previously visited web pages since the start 
of the session using a generated history list. 

• Index / table of contents. This provides end-users with an overall view of where 
they are to prevent disorientation. 



5.2 Provide Workspace and Equal Opportunity for More Flexibility and 
Manipulation of Search Results 

If digital libraries are to be user-centred, there is a need to make them adaptive and 
adaptable, taking into consideration end-users’ needs and browsing patterns [2]. 
Cockbum and Jones propose building a graphical browser that dynamically adapts to, 
and reinforces, end-users’ browsing actions and mental models [6]. Efficient search 
and linking facilities should be incorporated within digital libraries. One of the 
biggest challenges in the digital libraries is finding something specific since there is 
so much information available. On-going research is conducted to provide more 
accurate, faster and more efficient search and linking facilities on the web include 
automating indexes (such as web robots or spiders) to walk the entire server tree, text 
compression techniques, machine learning techniques, etc. Examples include: 

• Meta-search engines (for example, MetaCrawler Parallel Web Search Service; 

Savvy Search; ProFusion, etc.) use multi-threaded query gateway to query 
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multiple search engines (for example, InfoSeek Search; Lycos; WebCrawler; 
Web Worm; JumpStation, etc.) simultaneously [10]. 

• The New Zealand Digital Library for Computer Science uses modern 
compression techniques to provide access to over 10 000 documents worldwide 
in computer science, and makes them available over the web through full-text 
interfaces [27], 

• dec’s Library Information Access Client supports a card catalogue metaphor and 
represents individual searches as objects that can be moved and stored. The 
search results are colour-coded to let end-users know which results go with which 
searches [22]. 

It is important to have effective search engines, but as Agosti, Gradenigo and 
Marchetti [1] argue, it is important to properly represent the results to users. However, 
Harman [9] argues that we need more than just "user friendly" front ends. The whole 
system must be designed for usability. 

The problem with the search facilities provided by NCSTRL, ACMDL and 
NZDL sites was the lack of a facility to manipulate the search results independently 
of the search mechanism itself This had practical repercussions. For example, in the 
NCSTRL library one subject reported that they were able to discover result sets of 
different sizes (54 and 67) on the same search item, but was prohibited from sorting 
the larger set, as it was not created directly from the search mechanism, but instead by 
clicking on the author name in a result set (searching on the same data yielded yet 
another result set 175 but that was clearly too general). Also, in all the libraries the 
only way to return to a search result set was to re-execute the search, or use the 
browser "back" button to return to it. Rejected items are still included, and no 
indication of value is given to selected items. 

It may be useful to compare this to the opportunity of activity in a traditional 
library. Here, readers are able to get a list of possible items of interest, and retrieve 
them for further inspection. Those which prove of cursory interest can be set aside or 
returned to their usual place quickly, and those of greater use can be gathered together 
for deeper investigation. 

A comparison can easily be drawn between these work patterns and the principle 
of Equal Opportunity introduced as a heuristic for human-computer interaction [21]. 
Here, the user can exploit the prior output of the computer as input to a further stage 
in interaction, with or without modification. 

If we thus introduce to the digital library facilities a "desk" for interaction on the 
basis of Equal Opportunity, the reader of the library gains the opportunity both to 
mirror real world behaviour, and to interact effectively with additional digital 
facilities in the same domain. For instance, discovered items can be collected, 
ordered, prioritised, remembered, etc within the digital library space. More 
concretely, in the example above, the larger output set could be selected, and then 
reordered using existing facilities, making the provision of effective support for the 
reader more complete. 
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5.3 Provide Culturally Seusitive User luterfaces 

To provide multi-cultural interfaces to digital libraries, we envisage the development 
of boundary objects between different cultures accessing shared information 
resources. Boundary objects organise shared but simultaneously distributed cognition. 
Boundary objects are used by different communities without presupposing a fully 
shared definition of an object. They are flexible enough, such that each community 
can read a specific meaning from a boundary object sufficient to its needs. 
Simultaneously, they are "robust enough to maintain a common identity across sites" 
[24]. As such, they enable collaboration and communication across cultural 
boundaries on equal terms, for example, without recourse to a single-sided dominant 
mode of symbolisation. 

Boundary objects function between human cultures in much the same way that 
module interfaces separate implementation concerns in programming, but 
nevertheless allow modules to communicate without accidental assumptions causing 
trouble. To achieve the emergence of inter-cultural boundary objects in digital 
libraries, co-operative and communicative features need to be introduced that allow 
negotiation and articulation across sites. 

We offer some ideas for implementation of boundary objects in three areas: 

• Creation of boundary objects as part of the digital library interface. 
Actually, a digital library system with perfectly localised interfaces could 
function as a joint composite boundary object. However, small boundary 
objects and shared resources could start off a process of mutual cultural 
education between users, designers and content providers. The introduction 
of asynchronous message systems, repositories and frequently asked 
questions (FAQs) could serve such a function because it allows users, 
designers and content providers to quickly exchange information. Another 
idea is to build graphical browsers that rely on dynamically generated 
structure maps that adapt to end-users’ needs and come in various forms [16]: 
global maps show the entire hyperspace; local maps show the "vicinity" of 
the current node in terms of hyperlinks to and from other related nodes; and 
fisheye views focus attention on important nodes by deliberately distorting 
the view. 

• Creation of a learning environment. The emergence of boundary objects 
depends on mutual education of the participants. Therefore, in digital library 
interfaces, a learning environment is necessary. In order to create a learning 
environment, we need to provide additional facilities that help end-users, 
content providers and designers in fulfilling their tasks or even to provide 
intelligent intermediaries to do the tasks for them. In conventional libraries, 
the provision of this kind of support is the helpdesk manned by a librarian. 
While helping the users to find information by doing things for them, the 
librarian is also often surreptitiously teaching the users how to make the best 
use of the library. As a result, users are able to do at least part of the task on 
their own. Simultaneously, the librarian learns about the interest of the users. 
Often the support from librarians is augmented by the provision of support 
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from user to user. More experienced users can offer informal help and advice 
to novice users. In creating such a learning environment for end-users, we 
should provide suitable support features when collaboration between users is 
most effective. The construction of Community Memory Support Systems 
like Answer Garden and FAQ lists will allow end-users to gain an 
understanding of how systems can be used. 

• Creation of opportunities to create boundary objects by users. Even the best 
designer cannot foresee all cultural problems and possibilities. The idea, 
therefore, is to create opportunities for end-users to create boundary objects. 
Giving end-users the opportunity to articulate and exchange their ideas and 
problems with regard to a particular digital library may also provide 
surprising ideas that could be taken up by designers. Awareness mechanisms 
have to be developed that will allow end-users to be aware of when others 
are accessing the same resource. The use of synchronous co-operative 
support tools like Chat Rooms and Meeting Rooms will allow end-users to 
discuss and debate different approaches to accessing the on-line resources. 
The core use of these tools is to support the co-operation and debate needed 
to resolve decisions. To help end-users tackle the problem of information 
overload as well as not to be "lost" in the wealth of information available, we 
suggest the use of interface agents in digital libraries to make them more 
adaptive to end-users’ needs. Interface agents make software more active and 
work autonomously without waiting for end-users’ command. One example 
of the use of software agents in digital libraries is the investigation of 
personalised information fdtering systems to help end-users to eliminate 
irrelevant information and bring relevant information to end-users’ attention 
[15]. 



6 Conclusions and On-Going Work 

In this paper, we carried out a study to investigate useful design features digital 
libraries should have. The study provided insights on the usability impact of digital 
libraries for task completion and end-users’ perceived impressions on the 
effectiveness of the digital libraries. We discussed design guidelines for the design of 
user-centred digital libraries. This is on-going research for us. In order to achieve our 
goal to define a set of principles for the design of digital libraries, the design features 
discussed in this paper need to be further refined, tested and used in real-world 
situations before they can emerge as principles for design of user-centred digital 
libraries. 
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Abstract. The ultimate goal of an information provider is to satisfy 
the user information needs. That is, to provide the user with the right 
information, at the right time, through the right means. A prerequisite 
for developing personalised services is to rely on user profiles representing 
users’ information needs. In this paper we will first address the issue of 
presenting a general user profile model. Then, the general user profile 
model will be customised for digital libraries users. 



1 Introduction 

It is widely recognised that the internet is growing rapidly in terms of the number 
of users accessing it, the amount of information created and accessible through it 
and the number of times users use it in order to satisfy their information needs. 
This has made it increasingly difficult for individuals to control and effectively 
seek for information among the potentially infinite number of information sources 
available on the internet. Ironically, just as more and more users are getting on- 
line, it is getting increasingly difficult to find relevant information in a reasonable 
amount of time, unless one knows exactly what to get, from where to get it and 
how to get it. New emerging services are urgently needed on the internet to 
prevent computer users from being drowned by the flood of available information. 

Typical information sources on the internet, like search engines, digital li- 
braries and online database {e.g., [1,6,12,13], just to mention some), provide a 
search and retrieval service to the web community at large. A common charac- 
teristics of most of these retrieval services is that they do not provide any per- 
sonalised support to individual users, or poorly support them. Indeed, they are 
oriented towards a generic user. In fact, they answer queries crudely rather than, 
for instance, learning the long-term requirements idiosyncratic to a specific user. 
Moreover, they seldom select and organise information for users accordingly, e.g., 
assisting in the selection of books or other archived documents from libraries, 
news items from press agencies, television station and journals, or documents 
from administrative bodies. Providing personalized information search and de- 
livering services, as additional services to the uniform and generic information 
search offered today, is likely to be the first step to make relevant information 
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available to people in the appropriate form, amount and level of detail, at the 
right time through the right means, and with minimal user effort. 

A prerequisite for developing systems providing personalised services is to rely 
on user profiles, i.e. a representation of the preferences of any individual user. 
Roughly, a user profile is a structured representation of the user’s needs through 
which a retrieval system should, e.g., act upon one or more goals based on that 
profile and autonomously, pursuing the goals posed by the user (irrespective of 
whether the user is connected to the system). 

It is quite obvious that a user profile modeling process requires two steps 
(which constitutes the user profile modeling methodology). We have to describe 

— what has to to represented, that is which information pertaining to the user 
has to represented, and 

— how this is information is effectively represented. 

The topic of this paper is to describe both steps. We will show that the first one 
can be described in a quite general and application independent way, while the 
second one depends on a particular application. In order to be concrete, we will 
propose a user profile model which can be used in the context of the NCSTRL 
digital library (Networked Computer Science Technical Reference Library) [13]. 
Essentially, using profiles, users will be able to create their own customised 
scientific interest representation. This allows the digital library to provide a 
“notification service”, by e.g. e-mailing the users when documents (like technical 
reports and articles) matching their scientific interest become available in the 
digital library.^ The interesting point is that simple modifications to the existing 
architectures are sufficient in order to provide this service. 

We proceed as follows. In the next section we will introduce those concepts 
which have to be taken into account in a quite general user profile modeling 
process. In Section 3 we will apply these concepts to a special case: digital 
libraries (like NCSTRL). In Section 4 we will present two solutions for extending 
an existing search service in a retrieval system in order to take into account user 
profiles. Section 5 concludes and describes further work. 

2 User Profile Modeling 

The topic of this section is to describe some general concepts involved in the 
user modeling process. In particular, we will describe what has to represented in 
a user profile from the users point of view. 

2.1 The General Information Retrieval Scenario 

The general concern of a user is the retrieval of relevant information that per- 
tains to its information needs. So, let us first introduce a global and general 
information seek scenario (see Fig. 1). 

^ Similar features are promised within the ACM Digital Library [6] as a forthcoming 
service. 
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Main Actors 

• User Information Needs 

• Information Sources 



Main Tasks 

* Gather relevant information 

• Deliver information 



User Information Needs 



The Goal 

Satisfying User’s 
Information Needs 





Fig. 1. The information seek scenario 



We can distinguish two main actors in it: the user information needs and the 
information sources. 

User Information Needs. With user information needs we mean “what” a user 
is really looking for. Examples of user information needs may be 

1. “I’m looking for journal articles about computer networking, published not 
later than 1996. I want to pay less than 2$ for each.” 

2. “I’m looking for news concerning the latest trend about stock quotes of 
High-Tech companies.” 

3. “I’m looking for MPEG videos about Formula One races, downloadable from 
the web in less than 2 hours.” 

4. “I’m looking for hike tours in the Alps.” 

In the following, we will consider the information needs described by Point 1. — 4. 

The first observation is that a user information need may be quite different 
w.r.t. information type and content. With respect to the type, in cases 1. — 4. 
we are looking for journal articles, news, MPEG videos and images, respectively. 
These describe the type of information we are looking for. On the other hand, 
w.r.t. the content, in cases 1. — 4. we are also looking for information which is 
about computer networking, about stock quotes of High-Tech companies, about 
Formula One races and geographical maps with hike trails. These describe what 
information we are looking for from an information content point of view. 

The second observation is that not only different users have heterogeneous 
information needs, but there may be a heterogeneity in between the needs of a 
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single user too. That is, the information needs 1. — 4. may belong to four different 
users or may be four different needs (from a type and content point of view) of 
the same user. 

A third and final observation is that a user information need may be a short 
term user information need or a long term user information need. In the former 
case we refer to an ad-hoc, occasional user information need, whereas in the 
latter case we refer to a user information need which is of interest during a 
relevant time period. It is easily verified that in fact our daily information seek 
process involves both temporary needs as well as long term interests. Of course, 
whether an information need is a short term or a long term interest depends 
on the user. For instance, if an economist (say, John) is planning to hike in the 
mountains next weekend (an event that seldom happens for John), then he is 
looking for some site map and tour and may express its need through Point 4. 
above. This is a short term information need of John. But, John, as a serious 
economist, is also interested in any kind of news related to stock exchange quotes. 
He may express his information need through Point 2. above. Of course, this is 
a long term information need. On the other hand. Point 2. may be a short term 
interest of a computer scientist (say, Tom), whereas Point 1. may be his long 
term interest. 

In summary, information needs may differ w.r.t. their type, their content 
and their duration (short term and long term). Moreover, information needs are 
heterogeneous among users and in between users. All these aspects have to be 
taken into account during the user profile modeling process. 



Information Sources. With information sources we mean all the heterogeneous 
digital information providers distributed over the Internet, which make available 
any kind of information which might be of interest to Internet users. Examples 
of information sources are web sites, online databases, news groups, news agen- 
cies, search engines, digital libraries, etc. Essentially, they differ in what kind 
of information they provide, what services they provide and which users they 
address. 

The ultimate goal of an information provider is to satisfy user information 
needs, that is to provide the user with the right information, at the right time, 
through the right means. It is easily verified that this requires the execution of 
two separate tasks: to gather relevant information and to deliver them. The first 
tasks, typically the hardest one, is that of gathering the information which is 
thought to be of interest to the user. Once the information has been collected, 
it has to be delivered to the user, according to his preferences (second task). 
Examples of delivery modalities may be web pages (this is the usual case for 
which most of us are familiar with), e-mail, phone, fax {e.g., a user wants to 
receive stock quotes by phone, e-mail or fax), or surface mail {e.g., a user wants 
to receive the proceedings of a conference by surface mail) . 

As far as our work concerns, we will concentrate on user information needs. In 
particular, in the next section we will refine the concepts involved in the user 
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information need modeling process. An equally important topic is the modeling 
of information sources, which we will not address in this paper. 



2.2 The Data Categories of a User Profile 

In this section we will a present a general user profile model through which we 
may represent user’s preferences and needs. 

By relying on the discussion of the previous section (see also Fig. 1) it is 
quite clear that in a user profile we have to represent at least 

— what has to be gathered, and 

— how the gathered information has to be delivered to the user. 

We will show in the following that the information to be represented about the 
users is not only restricted to the two categories above, but may be classified in 
fact into (at least) five data categories. These categories are the personal data 
category, the gathering data category, the delivering data category, the actions 
data category, and the security data category. In the following we will describe 
these five categories in detail. 



The Personal Data Category. The personal data category is a collection of 
user’s personal identification data. Under this category we consider data like 
user’s name, birth date, gender, identity certificate, employer, home contact 
information, business contact information, etc. (see e.g., [14] as a concrete case). 



The Gathering Data Category. The gathering data category collects preferences 
and restrictions about the documents a user is looking for. These preferences and 
restrictions may be classified into three distinct subcategories, each addressing 
orthogonal document dimensions. These subcategories are: 

~ the document content category, specification of what has to be gathered. 
Under this category we consider preferences on document’s properties that 
relate to the content a user is looking for: the document language and its 
aboutness. For instance, “Fm looking for documents talking about computer 
networking, written in English.”. 

— the document structure category, user’s specification of all those properties of 
a document he/her is looking for which relate to the structure of a document, 
like its format (text formats, image formats, audio formats, video formats), 
its type (article, technical report, proceedings, news, novel, poem, www home 
page), its creation date, its cost and dimension. For instance, “I’m looking 
for GIF images created today.” 

— the document source category: specification of where to gather from. In this 
category we collect all the user’s restrictions on the source from which he/she 
would like to receive information, like a restriction on the URL {e.g., “I want 
only documents from http : // www . w3 . org/” ) , the specification of publishers 
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{e.g., “I want only news from Reuters”), series {e.g., “I want only articles 
from the Lecture Notes in Computer Science”), author’s {e.g., “I’m looking 
for audio records of Giuseppe Verdi.”). 

In summary, in a user profile we should allow the representation of what to 
gather (in terms of the structure and the content of a document) and where to 
gather from. 



The Delivering Data Category. Under the delivering data category the user 
specifies preferences on the delivery modes of the gathered information. These 
preferences may be classified into two distinct subcategories, each of them ad- 
dressing orthogonal delivering dimensions. These subcategories are: 

1. the delivery means category: specification of how to deliver. In this category 
we consider user preferences regarding the delivery means, like phone, fax, 
web and e-mail, that should be used in order to deliver the information the 
user requested for. 

2. the delivery time category: specification of when to deliver. In this category 
we consider user preferences regarding the delivery time, like interval {e.g., 
“deliver me the news I’m interested in each morning at 9 am, except during 
the weekend.”) and as soon as possible {e.g., “deliver me the news I’m inter- 
ested in as soon as you gather it.”, or “deliver me the stock quote exchange 
rate I’m interested in as soon as it looses more than 5%.”). 

In few words, in order to represent the user’s delivery preferences we should 
represent how to deliver and when to deliver. 



Actions Data Category. A personalised service should be highly responsive to 
the needs of the user. In particular, long term information needs involve repeated 
interactions with the user. Assuming that a lot of the user actions are consistent, 
a retrieval service should match increasingly better his/her needs over time. Fur- 
thermore, since the interaction could extend over a long period of time, it cannot 
be assumed that the users interests will remain constant. The change in interest 
could be anything from a slight shift in relative priorities to completely losing 
interest in some domain and gaining interest in another. In general, a system 
must be able to detect or must allow the user to indicate the change in interests 
and should respond by adapting to these changes. The system must be able to 
explore newer domains and prospect for interesting information. To summarise, 
personalised service should be capable not only of dealing with the currently 
known needs of the user, but also exploring different domains to find documents 
of potential interest to the user. Thus, it should be specialised, adaptive and 
exploratory. 

In order to provide a service with the above capabilities, under the actions 
data category, we collect a set of actions, not necessarily taken only by the user 
him/her self. The actions data generally contains the recording of the user’s 
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interaction with retrieval systems and navigation data. Typical actions data may 
be URLs of visited web pages, read documents and user’s relevance judgements. 
By relying on techniques based on user relevance feedback [2,4,8,9,11,15,16,17], 
as well as on collaborative feedback (in this case the user is usually member of 
an interest group) [5,10], the actions may be profitably be used for refining the 
user’s gathering data specification. 



Security Data Category. Users wants to express their privacy practices. The se- 
curity data category is a collection of user preferences establishing the conditions 
under which the data represented in the user profile may be accessed. These pref- 
erences may regard all the previous categories (the personal data, the gathering 
data, the delivering data and the actions data categories) . Typically, user’s may 
establish different privacy practices for each of the services they access to. An 
extensive work about privacy preferences can be found in [14]. 



An Example. We end this part with an example, illustrating the concepts intro- 
duced in the sections above. Suppose a user’s profile is as follows: 

1. “I’m John Smith, 34 year old and I’m looking for 

2. video sequences, dated after than April, 1st, for which I don’t want to pay 
for, 

3. which are about Michael Schumacher driving his Ferrari and 

4. published by FIA (Federation Internationale de 1’ Automobile) . 

5. Deliver me as soon as possible 

6. an audio summary message of the top ranked video I’m interested in and a 
SMS message containing the source URL, at my cellular phone, -1-39.0347.593404. 

7. I have already seen http://www.fia.com/news/newsl.mov and consider 
http://www.ukmotorsport.com/news/ferrari.html as relevant to what 
I’m looking for. 

8. I do not allow to access to my personal data.”. 

According to the user profile schema resumed in Table 1, Point 1. pertains to the 
personal data category. Point 2. pertains to the document structure category. 
Point 3. pertains to the document content category. Point 4. pertains to the 
document source category. Point 5. pertains to the delivery time category. Point 
6. pertains to the delivery means category. Point 7. pertains to the actions data 
category and Point 8. pertains to the security data category. 

3 A Profile Schema for Digital Library Users 

As for documents there exists several ways to represent them (like the vector 
space model, Dublin Core, MARC, etc.), similarly, there may be different, ap- 
plication dependent, user profile representations. In this section a profile schema 
tailored for digital library users is discussed. The proposed profile schema, while 
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Table 1. User profile schema summary 



User Profile: 

personal data category 
gathering data category: 

document content category 
document structure category 
document source category 
delivering data category: 

delivery means category 
delivery time category 
actions data category 
security data category 



remaining within the general user profile model presented in the previous sec- 
tions, tries to capture typical aspects that can be required by a digital library 
user. These features, if well exploited, can significantly help an advanced digital 
library to automatically search for documents relevant to the user. For instance, 
a particularly interesting case concerns digital libraries users having long term 
interest, as, e.g. scientist. In this case, a digital library (like NCSTRL and ACM 
[6,13]) may notify the user as soon as a new article, technical report or the like 
has been made available and matches his/her research interests. 

We will first describe the general structure of the user profile schema, then 
particular attention will be paid to the gathering data category. 

3.1 The Profile Schema 

Users that want to exploit the retrieval capabilities of a digital library are sup- 
posed to subscribe to the service. As consequence of the subscription, a person- 
alized profile is created for the user. The profile is identified by a unique profile 
identifier. This can be formalized as follows: 

Profiles = Prof ID — >■ UserProfile . (1) 

As we have seen, user profile data may be classified into five categories. We 
formalise this with 



UserProfile = (PersDataxGathDataxDeliDataxActDataxSecData) . (2) 
In the following we will formally describe each of these categories. 

Personal Data. The personal data category contains information about the user 
identity. For complying a standard, we propose to rely on the PSP “user” schema 
[14] for the PersData specification. 
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Gathering Data. This category of the user profile specifies what documents a 
user is interested in. In Section 2.1 we have seen that a certain user may have 
at the same time several different interests. In the user’s profile all the user’s 
interests should be described separately so that different types of preferences 
can be specified for different interests. We call topic a single user information 
need. In order to capture the fact that the user may have several interests, the 
proposed user profile is associated with a set of topics. This is formalized as 
follows: 



GathData = TopicID — >■ Topic . (3) 

Each topic is identified by a topic identifier that should be unique for a given 
user profile, i.e. the pair {Prof ID, TopicID) is unique. The complete definition 
of a topic will be given separately in Section 3.2. 



Delivery Data. Different users may have different delivery modalities. In order 
to take into account the delivery means and the delivery time, we formalise 
DeliData as 



DeliData = {DelMode x TimeMode) . (4) 

The delivery mode contains the specification of which means should be used 
to deliver information, how the delivered information should look like and the 
destination address. More formally: 

DelMode = {Del Means x Layout x Destination) . (5) 

The user can choose to be notified using one of the delivery means available, like 
e-mail, web page, phone, fax, etc. Since the potential users of a digital library 
like NCSTRL are scientist, e-mail or web page is adequate. In the former case an 
e-mail, formatted accordingly the layout preferences, is sent to the address speci- 
fied in the destination field. In the latter case, retrieved documents are published 
in the web page identified by the destination field and formatted accordingly 
the layout preferences. The layout specification may contain preferences about 
e.g. the colors and fonts to be used, and preferences about the information to 
be included for each relevant document found {e.g. title, abstract, authors, key- 
words, etc.). The destination specification is an address identifier that depends 
on the delivery means. 

The time mode specifies when to deliver. We will consider basically a con- 
dition like “new document found” and “updated document” associated with a 
delivery time. The deliver time can be a fixed time interval (e.g. every day at 9 
am) or “as soon as possible”. Formally we have: 

TimeMode = {NewDoc x UpdatedDoc x Time) 

NewDoc = (yes -f- no) 

UpdatedDoc = (yes -|- no) 

Time = {Timeinterval + asap) , 



( 6 ) 
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where Timeinterval is defined accordingly to the Unix OS crontab file. 

It is worth noting that the formalisation of DeliData establishes that a user 
has an unique delivery modality. We may enhance the schema by allowing a 
delivery modality, for each user topic - i.e. user interest. This is, for instance, 
required to model cases like ^‘upload the proceedings of ECDL to my ftp server, 
while send me the abstracts of papers about filtering systems by e-mail”. In 
order to take this into account we may formalise DeliData as follows: 

DeliData = TopicID — >■ {DelMode x TimeMode) . (7) 

Actions Data. The actions data category is a sequence of pairs each representing 
an action performed on a certain document. As actions are typically used for 
coding user’s relevance feedback within one of his topics of interest, we formalise 
Act Data as follows: 

ActData = TopicID — >■ {Action x DocumentI D)* , , 

Action = (read + relevant + notrelevant) ' 

a,nd DocumentI D is the identifier of a document notified to the user (a URI). 
The action identifiers will be used according to the following meanings: 

read: the user looked at the full text of the document 

relevant: the user judged the document as relevant 

not relevant: the user judged the document as not relevant 

Of course, other actions may be included as well. It’s beyond the scope of this 
paper to further detail how the actions data may be used for relevance feedback 
analysis. 



Security Data. As users subscribing to a service agree on its privacy practices (see, 
e.g. [14]), the simple privacy maintenance mechanism we adopt is to specify in the 
security data category which on-line services may access the user’s information. 
It is basically a list of the hosts that are authorized to ask for information 
contained in the user’s profile. 

SecData = {HostName)* . (9) 



3.2 Topics 

As specified in Section 3.1, a user may have several topics of interest and a topic 
specifies what to gather. Accordingly to Section 2.2, we define Topic as 

Topic = {TopicN ame x DocContent x DocStruct x DocSource) . (10) 

The document content category contains the information that allows the system 
to recognise documents relevant to a topic from a document content point of 
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view. In digital libraries, the content of a document is described by means of 
its title, its textual description (the abstract or summary), a list of relevant 
keywords and a list of standard categories {e.g. the ACM categories) and its 
language. We formalise this as 



DocContent = {Title x TextualDescr x Keywords x Categories x Languages) . (11) 



The document structure category contains information that allows the system 
to eliminate candidate documents according to their structural properties. Doc- 
uments in digital libraries can be stored in different file formats and they can 
be different types of documents {e.g. book, technical report, scientific article). 
Moreover, documents have a publication date and, in case of a paying service, 
they can have a price. The DocStruct category is defined as follows: 



DocStruct = {FileF ormat x Type x PublicationDate x Price) 

FileF ormat = (all -|- postscript -|- pdf -|- html -|- . . .)* (12) 

Type = (all + book -|- technicalreport + journalarticle -|- . . .)* . 

The all is used in order to specify all file formats or all document types. 

A system should eliminate unwanted documents by considering the informa- 
tion about the source of a document . The document source category is intended 
to provide a conceptual information that specifies from where to gather docu- 
ments. We model DocSource as follows: 

DocSource = {Allow Sources x DenySources) 

AllowSources = {Sources) ... 

DenySources = {Sources) ' 

Sources = {Collection* x Publisher* x Series* x Author*) . 

The all value is used to indicate all collections, all publishers, all series or all 
authors. The deny list contains sources that should not be considered while the 
allow list those that should be considered. 

4 Architecture 

In this section we will show how an existing digital library may simply be ex- 
tended in order to provide a new service: to alert automatically a user when a 
new document, matching the user’s profile, is available in the digital library. 

There are basically two different possibilities to implement the above func- 
tionality which we call pull modality and push modality. In the former case the 
profile is used in order to generate a query based on it and submit the query to 
the native information retrieval engine of the digital library. In the latter case, 
any new incoming document is matched against all available profiles in order to 
select those which the document is relevant for. The two possible corresponding 
architecture are depicted in Fig. 2 and Fig. 3. 
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The profile manager and the deliverer component are common to both ap- 
proaches. The profile manager mainly maintains the user profiles. It allows users 
and authorized components to modify a user profile and send profiles or portion 
of it to authorized components that request them. The deliverer component is 
responsible for delivering according to user’s delivery preferences. 

In the pull modality, the scheduler component at scheduled times, depending 
on the profile preferences, generates queries based on the profiles content and 
submit them to the built-in information retrieval engine of the digital library. 
From the result list we have to consider only those documents which have not 
yet been delivered to the user. The obtained list is returned to the delivery 
component. It is worth noting that this solution may be applied to any existing 
digital library as ideally no modifications are needed to existing systems. The 
pull module could be customized for different digital libraries just defining for 
the scheduler component ad-hoc wrappers that translate profiles into queries. 
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However, the pull modality allows to implement an approximated “as soon as 
possible” functionality only. In fact, this functionality is implemented by timely 
asking the digital library whether there are new documents available (sched- 
uler’s job). While this solution is unpracticable for rapidly changing information 
sources, like news feeds, it seems to be quite feasible in the context of digital 
libraries. For instance, in the case of NCSTRL, is it more than enough to query 
the library once per day and deliver a message for each satisfied profile. 

In the push modality, as soon as a new document is available in the digital 
library, the document is sent to the filter component that matches it against all 
profiles. For all matching profiles, the deliverer component sends a notification. 
It is quite clear that the push schema allows “as soon as possible” notifications 
to be effectively handled. Another advantage is that, since only new documents 
are checked for profile matching, it is guaranteed, unlike the pull modality, that 
only new documents are delivered to the user. Unfortunately, in order to be 
implemented, some modifications to the existing digital library are needed. 

It is worth noting that a system which, among others, uses both of the 
solutions above, together with a quite similar profile schema as proposed in this 
paper, has been implemented within the Eurogatherer Project [7]. In vew words, 
the Eurogatherer system is a personalised gathering and delivering system which 
offers user profiling, gathering from heterogenous information sources and all the 
delivery modalities described in this paper. 

5 Conclusions 

There are three contributions in this paper. First, we have presented a quite 
abstract user profile model in the context of an information seek scenario. We 
discussed what information has to represented into a user profile. Second, we 
have described a user profiles for digital library users. Third, we have provided 
two simple architectural solutions in order to extend digital libraries to cope 
with user profiles. 

Actually, we are planning to provide a personalised search service, by rely- 
ing on user profiles, within the ETRDL (ERCIM Technical Reference Digital 
Library) [3] in which our institute is involved. 
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Abstract. Textual information available has grown so much as to make 
necessary to study new techniques that assist users in information access (lA). 
In this paper, we propose utilizing a user directed summarization system in an 
lA setting for helping users to decide about document relevance. The 
summaries are generated using a sentence extraction method that scores the 
sentences performing some heuristics employed successfully in previous works 
(keywords, title and location). User modeling is carried out exploiting user's 
query to an lA system and expanding query terms using WordNet. We present 
an objective and systematic evaluation method oriented to measure the 
summary effectiveness in two lA significant tasks: ad hoc retrieval and 
relevance feedback. Results obtained prove our initial hypothesis, i.e., user 
adapted summaries are a useful tool assisting users in an lA context. 



1 Introduction 

Expansion of computer networks and data massive storage devices in the last years 
has caused an increasing interest for information access (lA) systems. Nevertheless, 
textual information available has grown so much that it is essential to investigate 
methods for minimizing overload effects and to ameliorate access to useful 
information for a particular user. In this work, we study how user directed summaries 
could contribute to improve the effectiveness of these systems. 

Nowadays, lA systems include a great variety of capacities. They offer information 
searching and navigation methods on resources as digital libraries that often collect 
multimedia material (documents, sounds, graphics, animations). They present new 
human-computer information seeking dialogues and interfaces for organizing and 
displaying retrieval results, encouraged by the World Wide Web continuous rise. 
Above all these possibilities, it is added other potentialities as metadata and 
knowledge representation, category hierarchies or multilinguality. 

Although, as we say, the digital libraries can store different kind of objects there is 
no doubt that text is the major component. According to [22], text represents about 
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90% of all information handled by an organization. We can locate text into 
documents, manuals, reports, e-mail, web pages, faxes, and presentations. Thus, it is 
understandable the efforts of lA systems to place within user’s reach these strategic 
information sources and to attempt to decrease the access costs. 

lA systems suggest, from a big document collection, those that it considers 
„relevant“ to a query made by a user in natural language. The query is a formulation, 
usually concise, of a user’s specific information necessity. The output of these 
systems generally shows a list of document titles, ranked by the rate of similarity to 
the query, and the top lines of these documents. However, we think it is evident that 
this information is, in most cases, insufficient for the relevance assessment by a user. 
This forces him to inspect the document full text, involving an extra time cost. 
However, even in this case, it can happen that the structure and the document size are 
not suitable. Then, it can be difficult for the user to find the information related and, at 
last, to decide about the document relevance. Therefore, it seems necessary to supply 
these systems with tools that assist in deciding about the significance of a document 
in relation to a certain information need. 

An interesting approach is to replace, in the query results, the document first 
sentences by a proper summary or extract. Until this moment, many works [16, 7, 23, 
13, 15, 27] describe systems that select sentences with relevant contents, using 
statistics and linguistics techniques. However, these works do not take into account 
either the knowledge domain of the user nor his particular information needs. 
Therefore, they generate summaries that can be called generics. 

To propose a solution to these problems, our work focuses on building a system 
that is able to generate text summaries directed to the user’s information requirements 
[19]. The user oriented summary generation is not a new idea. Thus, in [16] the 
possibility of adapting the summaries to particular interest areas or research fields is 
already mentioned. Luhn proposes to score the sentences assigning „a premium 
value“ to a predetermined class of words that identify these interest areas. Edmundson 
in [6] also thinks that the process of summary construction would have to be „goal- 
oriented“. That is, that the summary content would have to be explicitly defined 
according to its use. Moreover, in [24] they realize, employing users to evaluate the 
summaries of the system that they constructed, that they tend to select text sections 
closely related to their own interests. 

The quality measurement of a summary is a problem with hard solution. It is a 
complex task to determine the properties of a good extract. In some works [7, 15, 27], 
the evaluation involves comparing the automatic summary with a target abstract 
generated by human judges. Nevertheless, it is known that there is not a „single“ 
correct abstract for a document, so it does not seem to be a good evaluation method. 
Another approach is introduced in [9]; it proposes a task-oriented evaluation to 
measure the utility of the summaries for categorization and ad hoc retrieval tasks. 
This is the focus followed in SUMMAC to judge the participant summarization 
systems [18]. Our proposal of evaluation is also task-oriented in an lA context. 
However, we do not employ users as in other works [20, 17, 28], but a text retrieval 
(TR) system and a TR test collection. We suggest evaluating the summaries in two of 
the most significant tasks in an lA environment: ad hoc retrieval and relevance 
feedback [26]. In both cases, the common underlying idea is to achieve an objective 
and systematic evaluation method that allows us to use larger test collections than the 
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used in previous works and to compare the results of different automatic 
summarization systems. 

In conclusion, firstly, we think that to achieve the adaptation of summaries to users 
can improve substantially the effectiveness in AI. And secondly, it is necessary to 
evaluate the summaries in the context of the task for which they are generated. 

The rest of this work is organized as follows. First, we introduce the sentence 
extraction techniques used to generate our summaries. Next, we present our 
evaluation environment and results for both tasks: ad hoc retrieval and feedback. 
Finally, we describe our conclusions and future work. This paper is closed with an 
appendix including a document and two summaries adapted to different queries. 



2 User Directed Summaries 

The summarization system that we introduce in this paper is based on sentence 
extraction technique. This method permits to construct acceptable summaries that are 
domain independent. The summaries are generated selecting sentences of the original 
document that contain information highly indicative of its content. The selection is 
made scoring each of the text sentences using a heuristic set. Thus, the sentences 
whose score exceeds some threshold, or the sentences with the highest scores, up to a 
certain total, are chosen. The extract will be composed of the document title and the 
selected sentences in the same order that they appear in the original text. 

Among the most used heuristics for the construction of generic summaries, we 
have chosen the keyword, title, and location methods. The summary adaptation is 
obtained using two additional heuristics: processing the query to an lA system and 
using the synonymy relation of the lexical database WordNet. In the following points, 
we will deal with these subjects. 



2.1 Keyword Method 

The keyword method involves [16] looking for sentences that contain words with 
high frequency, commonly called keywords, believing that these words refer to central 
topics of the document. To score the sentences, Luhn defines the term clusters, as sets 
of keywords that don not have more than another four not-keywords between them. In 
this way, in [1] it is remarked that 98% of the lexical relations in English occur 
between words from an interval of five. However, other works like [15, 27] only 
consider the occurrences of isolated keywords. 

In our system, we use clusters of keywords for scoring the document sentences 
because we aspire to extract meaningful sentences from significant concepts. Each of 
our clusters has two or more keywords and a maximum of another five words 
between them. For each document, the keywords are the ten content terms (non stop- 
words) with greater tfidf (term frequency inverse document frequency) [27]. With 
the use of the tf idf weight we try to reject common use words in the collection 
domain. To score a sentence, the cluster with more keywords is selected. Using 
Luhn’s method, the square of the keyword number is divided by the total number of 
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words in the cluster. The definitive score for this method is obtained multiplying the 
previous value by the sum of the cluster keyword weights (tf idf). 



2.2 Title Method 

This method supposes that words occurring into document titles, subtitles, and 
headings could be good indicators of his content [7]. Moreover, we could think that, 
when the text author divides the document body among sections, he makes, in some 
way, a summary of them choosing the suitable titles. 

To compute the score for a sentence, we must bear in mind that usually the 
frequency of each term in the title is one and, in this case, its frequency in titles of 
other documents is not interesting to us. Therefore, the value that we consider is the 
quotient between the square of the title word number in the sentence and the number 
of terms (stop-words excluded) that form part of the title. 



2.3 Location Method 

Edmundson [7] uses this method in his summarization system. He scored the 
sentences occurring in paragraphs at the beginning or end of the document, and 
specially, the first and last sentences of each of these paragraphs, or bellow headings 
like: „Introduction“, „Purpose“ or „Conclusion“. 

For applying this method, we have studied the characteristics of the corpus used for 
our tests. It is a collection of journalistic documents and so, the most important 
information appears in the top lines. Thus, our system scores a positive and 
decreasing value for the ten first sentences of each document. 



2.4 User Adaptation Methods 

The summary adaptation to user’s information needs is one of the central issues of our 
work. We think that the utility of a summary depends on the reader’s requirements, 
and so we must not ignore them during summarization process. For example, suppose 
we have a document about the implantation of the Euro. A computing expert may 
consider that a summary of this document is „bad“ if this does not include some 
references to the necessary changes in the software of companies and government 
agencies. However, an economist will consider this same summary „good“ if it only 
mentions the financial aspects and its effects on the national economies. To solve this 
problem, it is necessary to represent, which are the user’s particular interests and 
consider them for the adaptation. In our example, the interests of the first user may be 
Euro implantation, computing, software, changes in actual systems, and its costs. On 
the other hand, the second user may be interested in Euro implantation, European 
Central Bank and effects in inflation, unemployment, budget deficit, economic 
growth, and stock markets. 

In the two next sections, we propose a first approach to this complex task of 
modeling user’s necessities. Considering the utilization of summaries in an lA setting. 
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we propose to use user’s query and to expand it using the relations contained in a 
lexical database. 

An example of summaries generated by the system presented in this work is shown 
in the final appendix. We had chosen a document and two user adapted summaries 
extracted from it, each one employing a different query. 



Query processing. In an lA setting, as it is our case, the user’s query is a fundamental 
element that can allow us to find out his information needs. Passage retrieval systems, 
like the one described in [14], use the words of the user query to identify the closer 
fragments to his information needs. In this way, when the document text is presented 
to the user, its relevant passages are identified and highlighted, and therefore, he can 
observe its distribution inside the text. Consequently, it is easier for the user to make 
the decision about the document relevance. This is demonstrated by the good results 
got in the evaluation of the system exposed in [14]. 

In a similar way, the automatic summarization can employ the words utilized by 
the user. The purpose is to select sentences with high semantic contents in relation to 
his query and, in consequence to his information needs. This approach is followed in 
[28], scoring the sentences depending on the number of words which are also in the 
query. 

Our system employs also the query as the base for the user modeling. That is, the 
summaries include sentences containing query word clusters, like the clusters of Luhn 
[16]. To score the sentences, similarly to the keyword method, first it finds the cluster 
with more query words. Then, the sentence score is computed as the quotient between 
the square of the query-word number of this cluster and the total number of words in 
the cluster. 



WordNet synonyms. A procedure to increase the information provided by a query 
could be to use the relations stored in lexical databases. The goal is to expand the 
query with other words that could be used to express the same concepts. Thus, we 
expect to increase the capacity to select sentences with high interest for the user. 

The lexical database WordNet [21] was already used successfully in tasks like TR 
[29] or text categorization (TC) [4]. In [29] all the lexical and conceptual relations 
included in WordNet are used to expand the query words. The evaluation shows that 
the utilization of WordNet enhances the effectiveness of TR systems when query 
statements are scarcely detailed. Buenaga et al. [4] use also synonym information 
from WordNet in combination with a training collection, to improve the performance 
of TC systems. The underlying idea is to utilize the synonyms to increase the amount 
of information on all categories and specially those with few training examples. The 
evaluation prove that this combined approach performs much better than those based 
only in training. In both works, the expansion is performed manually, picking, in a 
disambiguation process, the correct sense of the word to expand. 

Then, as WordNet represents concepts as synonym sets or synsets, our system 
exploit this information finding synonyms for the words into the user’s queries. Thus, 
all synsets of all possible meanings and parts of speech holding each of these words 
are selected, and any term belonging to them is added to the query term set. If a 
collocation appears as synonym, it is broken into their component words, stop-words 
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are removed and the remaining words are stemmed. Finally, the score for a sentence 
is computed in the same way as in previous method, but using the expanded query 
term set. 



3 Ad hoc Retrieval Task 

The search services are an essential part of digital libraries because they are the 
primary way in which users can access information. Among these services usually ad 
hoc retrieval is offered. Thus, for example, the ACM Digital Library allows its users 
to retrieve documents relevant to a query. Besides document title, the Digital Library 
presents additional information like bibliographic reference, index of terms, and 
abstracts. This information can be utilized by the user when he decides whether to 
purchase or not the article. Then, it seems important to determine the summary quality 
in a setting like the one we describe. 

<dom> Domain: International Economics 
<title> Topic: Patent Infringement Lawsuits 
<desc> Description: 

Document will discuss a current patent infringement 
lawsuit . 

<narr> Narrative: 

A relevant document will discuss a current, settled 
or pending, patent infringement lawsuit. It will 
identify the plaintiff, the defendant, and a specific 
(reference to a product or technology that infringes 
or allegedly infringes a patent) or general 
(reference to an industry) complaint. 

<con> Concept (s): 

1. patent infringement suit, patent-infringement 
lawsuit, patent dispute, patent action, patent case 

2. sue, file a complaint 

3. plaintiff, defendant 

4. issue a preliminary injunction, enjoin 

4. win, lose, settle, issue a permanent injunction, 
reinstate an injunction 

5. damages, damage claim 

Fig. 1. TREC topic number 20 

Thus, our experiments attempt to measure the way users are assisted to decide 
about the relevance of a document retrieved by an lA system using just the summary. 
Moreover, we want to prove the effectiveness of summary adaptation techniques to 
user’s necessities [19]. Next, we expound the evaluation environment of the 
experiments carried out in the ad hoc retrieval task and the results achieved. 
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3.1 Evaluation Environment 

To procure an objective and systematic evaluation we have to use a test collection and 
a TR system endowed with a module for performance evaluation. A test collection is 
made up of a corpus of documents, a set of queries, and the corresponding relevance 
assessments of each document for each query. Precisely, thanks to these assessments 
it is possible to measure the effectiveness of a TR system. 

Table 1. Characteristics of test collection used in ad hoc retrieval evaluation 



Number of documents 


5000 


Size (Mb) 


16 


Number of queries 


50 


Number of relevant documents 


385 


1 1-pt avg. 


0.2273 



For our experiments we have use the TREC (Text REtrieval Conference) test 
collection [11], one of the most utilised in TR. Particularly, we have selected at 
random 5000 documents from the Wall Street Journal (WSJ) corpus, a set of 
journalistic texts belonging to Tipster Information-Retrieval Text Research 
Collection, Volume 2. Also at random, it was chosen 50 TREC queries (called topics) 
from topics 1-100 with at least one relevant document within the selected corpus. 
Table 1 shows the characteristic of this test collection. 



Environment 

Finance 

International Economics 
International Finance 
International Relations 



Law and Government 
Military 

Science and Technology 
U.S. Economics 
U.S. Politics 



Fig. 2. Domains of topics selected for ad hoc retrieval evaluation 



The topics are composed of domain, title, description, narrative, and concepts, as it 
is shown in figure 1. The domain shows the topic scope. Figure 2 collects the domains 
of the topics employed in our experiments. The title would be concerned with a query 
made for a user of an lA system. The description and narrative provide a more 
detailed description of what constitutes a relevant document. The concept section 
encloses a list of words and phrases related to the topic. From all this information, we 
just utilize the title field because we consider it is the only one representative of an lA 
system user query. Table 2 shows the number of content terms (stop-words excluded) 
in corpus and TREC topic titles. 

Once decided the corpus to use in our experiments, we only need a TR system to 
carry out the evaluations. We have select Smart [3] because it is based on the vector 
space model, it includes automatic processing of index, retrieval, and evaluation, and 
it is publicly available (it can be got in ftp://ftp.cs.cornell.edu/pub/smart). 
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Table 2. Number of content terms of corpus and query sets used in ad hoc retrieval evaluation 



Expanded queries 



Num. terms 


Corpus 


Queries 


Fully 


Manually 


Maximum 


6123 


6 


no 


17 


Minimum 


9 


1 


1 


1 


Average 


383.74 


2.88 


26.96 


5.16 



The experiments have consisted of the evaluation of original text, initial segments, 
generic summaries, query-adapted summaries, and expanded query adapted 
summaries collections. The evaluation of corpus resulting from extracting initial 
sentences of the original text is the baseline of our experimentation. Thus, we can 
compare summary effectiveness with respect to the output of a conventional lA 
system, i.e. document first lines. 

To obtain generic summaries we have used the keyword, title and location 
methods. The final score for a sentence is computed normalizing the method scores 
into the interval [0, 1], multiplying by a weight and adding these results. In all the 
experiments, these weights have taken the values 1.5, 1, and 0.5, respectively. 

Next, we have created 50 collections of 5000 documents each one to evaluate the 
query-adapted summaries. Every collection included summaries of the entire original 
corpus directed to the same query. The precedent weights have been used 
incorporating 3 as query factor. Each collection was separately evaluated using the 
query employed to generate it. The definitive results for these kinds of summaries 
were obtained averaging the partial results of these 50 collections. 

Table 3. Full and manual expansion using WordNet for topic number 20. Synonyms of each of 
the title words are shown in different lines 

Fully expanded patent (letters patent) apparent evident manifest plain 
infringement violation misdemeanor misdemeanour 
infraction offence offense 
lawsuits suit case cause causa 
Manually expanded patent (letters patent) 
infringement violation 
lawsuits suit case cause causa 



As for the experimentation in expanded query using WordNet, we have carried out 
two experiments: 

1. For each content term in the topic title, the system takes any word belonging to all 
possible meanings and parts of speech synsets that hold such title term. 

2. The terms in the topic title are manually grouped, if it is possible, and expanded 
choosing only one part of speech and one meaning. 

Table 2 shows the number of content terms after making the previous query 
expansion processes. In table 3 we show an example of the two kind of query 
expansions for the topic of figure 1 . 
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♦ Original collection 


— -0 — Initial segment 


— ■ — Generic 


— — Query adapted 


— ▲ — Full expansion 


— -A — Manual expansion 



Fig. 3. Effectiveness of 1 5% length summaries 

Finally, we have decided the summary lengths, i.e., the number of sentences that 
the system has to select among the ones with higher scores. We had experimented 
with summaries of different lengths: 10, 15, and 30%. 



3.2 Results and Interpretation 

The results that we present in this point were obtained using each one of the 
previously mentioned collections, the set of topic titles and the relevance assessments, 
as test collections. Next, we had indexed and evaluated them using the Smart TR 
system. 

The effectiveness measures utilized in Smart are recall and precision [26]. Recall is 
the proportion of relevant documents in the corpus that are retrieved. Precision is the 
proportion of retrieved documents that are relevant. Usually, the average precision 
over all queries, at each of the eleven recall points (from 0.0 to 1.0 at intervals of 0.1), 
is used to compare different systems. 

The 11 -point average precision (i.e. the mean of the average precision values at 11 
recall levels) obtained for original corpus has been 0.2273, as it is showed in table 1. 
Table 4 introduces the results for the other collections and summary lengths that we 
have tested. Change percentages of the values concerning with document initial 
segment sets is also shown. 

The better results are achieved by the user-adapted summaries whereas generic 
summaries and initial segment got similar values. With respect to summary lengths, 
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the higher results are obtained for 15%. In figure 3 it can be compared the different 
effectiveness for summaries of this length. 

Probably, there is no difference between generic summaries and initial segment 
because of the characteristics of the corpus that we have used in our tests. The 
journalistic documents tend to condense the valuable information at the beginning of 
them. So, we think that using other kind of documents the results obtained for initial 
segment would be worse. 



Table 4. Results obtained in ad hoc retrieval evaluation 











Summary collections 




Length 




Initial 

segment 


Generic 


Query 

adapted 


Full 

expansion 


Manual 

expansion 


10% 


11-ptavg. 
% change 


0.1725 


0.1691 

-1.9 


0.2258 

30.9 


0.2234 

29.5 


0.2242 

29.9 


15% 


11-ptavg. 
% change 


0.1733 


0.1628 

-6.1 


0.2435 

40.5 


0.2368 

36.7 


0.2463 

42.1 


30% 


11-ptavg. 
% change 


0.1860 


0.1902 

2.3 


0.2436 

31.0 


0.2367 

27.2 


0.2402 

29.2 



Full expansion of queries does not attain to improve the results of query-adapted 
summaries. However, when the word senses are manually disambiguated, the 
precision increases slightly. Thus, the best result is obtained using manual-expanded 
queries adapted summaries with 15% length. This value even ameliorates the one 
achieved for the original collection in 8.4%. 

In conclusion, the synset disambiguation process seems, as it occurs in [29] 
expanding a query to improve TR effectiveness, essential. Full expansion includes 
many words in the expanding query set whose meaning is away from the sense in 
which the user employs them. In table 2, it can be noted that the average query set 
size pass from 3 in original queries to 27 in full-expanded queries. Moreover, it can be 
observed that the small improvement achieved using manual expansion is attained 
adding very limited information to the query (from 3 words per query to 5). Then, it 
may be expected to obtain better results integrating more WordNet information (like 
hyperonymy and meronymy relations). 



4 Relevance Feedback Task 

Among the advanced search services that a digital library may offer, relevance 
feedback is one of the most used because it achieves the effectiveness of TR systems. 
Going on the same example of the ACM Digital Library, it supports a kind of 
feedback allowing users to search articles related to one previously retrieved. 

In general, after performing a search and retrieving a set of documents, the user 
provides the TR system with feedback, designating if some of the retrieved 
documents are relevant or not. In the vector space model, this information may be 
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utilized by the system for query improvement in two ways: reweighting the query 
terms and adding new terms to the query [26]. 

Different experiments, using several relevance feedback methods and test 
collections, confirm that this technique improves retrieval performance. However, 
most of them had used collections of short texts (document abstracts) [25, 10] (as 
CACM, CISI, Cranfield, and others). The evaluations on full text collections (WEST) 
prove a smaller increase of effectiveness than in the cases where short text collections 
are used [8]. In this work, it is denoted that perhaps the cause is the lack of 
performance of relevance feedback methods when full texts are used, i.e. it may be 
that the term selection and reweighting are not suitable. In this address, Allan [2] 
examines the value of using passages of long document for feedback instead of full 
text. 

The purpose of this second evaluation is to check the quality of different 
summaries measuring the effectiveness improvements achieved in feedback when full 
text is substituted for summaries. We had focused our work on long documents due to 
the possibility that this kind of texts makes more difficult the relevance feedback 
techniques. Thus, if summaries collect only significant information, they may avoid to 
chose unsuitable terms. So, a solution for large documents in relevance feedback may 
be to utilize summaries instead of full text. In next points, we explain firstly the 
evaluation environment for the experiments on relevance feedback task and then we 
present the results obtained. 

Environment 
International Economics 
International Finance 
International Relations 
Law and Government 

Fig. 4. Domains of topics selected for RF evaluation 



Political 

Science and Technology 
U.S. Economics 
U.S. Politics 



4.1 Evaluation Environment 

For our experiments we have selected all the documents from the WSJ corpus at 
Tipster Volume 2 whose length was higher or equal to 1250 content words (stop- 
words excluded). Then, it was chosen at random 50 TREC topics from the interval 1- 
100 with at least one relevant document within the selected subcorpus. Tables 5 and 6 
show the characteristics of this test collection. Figure 4 collects the domains of topics 
employed. 



Table 5. Characteristics of test collection used in RF evaluation 



Number of documents 


2379 


Size (Mb) 


34 


Number of queries 


50 


Number of relevant documents 


181 


1 1-pt avg. - Initial run 


0.0589 


1 1-pt avg. - Feedback run 


0.1105 
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Salton’s and Buckley’s experiments [25] showed that the Me dec-hi relevance 
feedback method [12] obtains the best overall results for the vector space model. It 
derives the new query vector from the initial query, all the documents deemed 
relevant, and only the top one from documents identified as nonrelevant. This can be 
formulated as in (1), where and Q„ represent the new and original query vectors 
respectively, R, is the vector for relevant document i, and S is the vector for the top 
nonrelevant document. 



Q,=Qo+ R. /iS W 

all 

relevant 

For our experiments, we had supposed that user judgements are made over the top 
ranked 15 documents retrieved using the initial query. The document relevance is 
supplied by the relevance assessment list of the test collection. These documents are 
then used to obtain the feedback query using formula (1). 



Table 6. Number of content terms of corpus and query sets used in RF evaluation 



Num. terms 


Corpus 


Queries 


Fully expanded 
queries 


Maximum 


8396 


7 


88 


Minimum 


1250 


1 


1 


Average 


1655.78 


3.44 


23.66 



The evaluation of effectiveness is made using the residual collection method [5]. 
The residual collection is constructed from original collection removing the document 
shown the user (the top 15 in our experiments). Thus, it is possible to compare the 
results running the initial and feedback queries over the residual collection and then, 
to measure the improvement achieved with feedback. Using this evaluation method it 
is avoided to distort the results due to the foreseeable improvement that will gain 
because of the reranking of documents already seen by the user. Table 5 shows the 
11 -point average precision obtained for the test collection constructed in the initial 
and feedback runs over the residual collection. The low results obtained are 
consequence of the evaluation method utilized that excludes highly ranked relevant 
documents from the residual collection. 

The experiments for this task had consisted of the evaluation of generic summaries, 
query-adapted summaries, and full-expanded query adapted summaries. These 
summary collections are obtained in the same way as it is explained in point 3.2. 
Moreover, we wanted to compare their effectiveness with summaries generated using 
only the query processing method. So, we attempt to construct summaries just 
containing sentences relevant to user query like pseudo-passages. 

The summary lengths we have decided to use for the relevance feedback task are 5, 
10 and 15% over the number of original document sentences. 
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4.2 Results and Interpretation 

The results presented at this point were obtained running the initial search over the 
original collection but using the different summary collections for feedback instead of 
the full text. The queries used were the title of TREC topics chosen as it is shown in 
previous point. It was excluded those that had all its relevant documents within the 
top ranked 15, i.e. those without relevant documents in the residual collection. 

The utility of each kind of summary can be evaluated measuring the 11 -point 
average precision for feedback run and comparing it with the baseline results 
achieved employing full text. Table 7 shows, for each kind of summary and length, 
the average precision and the increased or decreased percentage on baseline (shown in 
table 5). 



Table 7. Results obtained in RF evaluation 









Summary collections 




Length 




Generic 


Query 

adapted 


Full 

expansion 


Only 

query 


5% 


11 -pt avg. 


0.1148 


0.1124 


0.1123 


0.1025 


% change 


3.9 


1.7 


1.6 


-7.2 


10% 


11 -pt avg. 


0.1146 


0.1090 


0.1090 


0.0971 


% change 


3.7 


-1.4 


-1.4 


-12.1 


15% 


11 -pt avg. 


0.1097 


0.1098 


0.1100 


0.1082 


% change 


-0.7 


-0.6 


-0.5 


-2.1 



The better results are achieved by generic summaries and for 5% length. Next, 
query adapted and full expansion summaries got similar values. The worst results are 
obtained by the summaries using only query processing method. 

Then, it is evident that the information collected for generic summaries is more 
useful for relevance feedback task than the select for user-adapted summaries. 
Moreover, although generic summaries attained to improve the results of full 
feedback run this gain is not as large as we expected initially. We think that both 
effects are due to the possibility that our summaries may collect redundant 
information. Perhaps among the highly scored sentences by the summarisation system 
there are some of them with similar content. Even more, the probability that the 
documents to summarise contain redundant information is greater when they are 
large. So, a possible way to ameliorate the system would be to avoid that redundant 
sentences were included in the summary. 



5 Conclusions and Future Work 

In this paper, we have presented a summarization system characterized by the 
incorporation of a user model to generate summaries adapted to user’s information 
needs. User modeling is carried out exploiting user’s query to an lA system and 
expanding the terms in the query through the use of synonymy relation stored in 
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WordNet. The purpose is to supply a summary directed to an lA system user that 
assists him to decide on document relevance avoiding the necessity of full document 
inspection. 

We have also exposed a systematic evaluation method that has allowed us to 
compare the effectiveness of different kind of summaries on corpus larger than the 
utilized in previous works. This evaluation has been carried out in an lA setting and it 
has been oriented to text retrieval and relevance feedback tasks. Ad hoc retrieval 
results have proved the superiority of user adapted summaries, improving even the 
performance obtained using original text. Moreover, relevance feedback experiments 
have supported partially the hypothesis that summaries can ameliorate full text 
effectiveness using large documents. Then, we can conclude that user adapted 
summaries are a useful tool assisting users in an lA context. 

In future works, we will try to achieve a better user model utilising more WordNet 
relations (like hyperonymy and meronymy) and incorporating information from user’s 
feedback. We are also interested in carrying out more experiments on corpora of 
several size documents. 
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Appendix 

In figure 5 a document belonging to the collection used in ad hoc retrieval 
experiments is shown. The figures 6 and 7 are examples of user adapted summaries of 
10% length. In both cases, this adaptation is achieved using the words in the title 
section of two TREC topics and expanding them using WordNet (full expansion). 
These query words or its synonyms appear in bold face. The specific sentences for 
each summary are shown underlined. 

(...) 

<HL> 

Medicine : 

Wellcome Halts 

Six-Year Effort 

On Heart Drug 

By Joann S. Lublin 

Staff Reporter of The Wall Street Journal 
</HL> 

{. . .) 

<LP> 

LONDON -- Wellcome PLC halted work on TPA after six years of 
costly research, a setback that may spur sales for Genentech Inc.'s 
flagship version of the clot-dissolving drug. 

At the same time, the major British pharmaceutical maker said it 
will accelerate development of Wellferon, its form of the drug 
interferon, by expanding production and seeking wider approval for 
the medicine's use in treating hepatitis. Some analysts praised 
Wellcome' s strategic shift, noting that interferon has greater 
long-term sales potential than TPA might have had in a market 
already clogged with competitors. 

</LP> 

<TEXT> 

Wellcome said it decided to discontinue development of TPA after 
losing a U.S. patent infringement suit brought by Genentech, a 
South San Francisco company that has agreed to a merger plan that 
would give a majority stake to Roche Holding Ltd. of Switzerland. 

Last month, a federal district court jury in Delaware found that 
Wellcome and Genetics Institute Inc., its U.S. collaborator, had 
infringed on three Genentech patents by trying to develop forms of 
TPA. Wellcome previously had won a U.K. patent case involving the 
drug, used during heart attacks to restore blood flow and thus 
limit damage. 

(...) 

Wellcome' s abandonment of TPA also may convince other drug 
companies also to relinquish development of their versions, Mr. 

Ref sum said. "It will restrict the development of further TPAs but 
enhance the development of other bio-technology drugs" because 
Genentech' s court victory demonstrated these medicines' patents may 
withstand legal challenges. In South San Francisco, the attorney 
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who defended Genentech's TPA patent position said that Wellcome' s 
withdrawal from the field prolongs the biotechnology firm's 
effective monopoly and discourages future competition. 

{. . .) 

Mr. Ref sum and other analysts said Wellcome was smart to switch 
research emphasis on genetically engineered drugs from TPA to 
Wellferon, which mainly has been marketed for minor indications 
such as hairy cell leukemia. The interferon appears useful in 
treating hepatitis B, the viral infection's most serious form. It 
is "one of their major products in R&amp/D," said Andrew Porter, a 
London analyst for Nikko Securities Co. But unlike Wellcome' s 
leading AIDS drug, AZT, he said, "Wellferon will be a useful but 
not spectacular contributor to Wellcome." 

Wellcome' s Mr. Sherwood said Italy recently cleared Wellferon 
for use in treating hepatitis B and "initial indications are that 
sales are going to be good there." Within a few months, Wellcome 
expects to gain marketing clearance in Spain and Southeast Asia. He 
said the company views those areas "as large potential markets" 
because hepatitis is so widespread there. 

</TEXT> 



Fig. 5. Document number WSJ9005 11-0120 

LONDON -- Wellcome PLC halted work on TPA after six years of costly 
research, a setback that may spur sales for Genentech Inc.'s 
flagship version of the clot-dissolving drug. 

At the same time, the major British pharmaceutical maker said it 



will accelerate 


! development of Wellferon 


, its form of 


the drug 


interferon, by 


expanding production and 


seeking 


wider 


approval for 


the medicine's 


use in treating hepatitis 








Wellcome' s abandonment of TPA also may convince 


other 


drug 


companies also 


to relinquish development 


of their versions, Mr. 



Refsum said. 

Mr. Refsum and other analysts said Wellcome was smart to switch 
research emphasis on genetically engineered drugs from TPA to 
Wellferon, which mainly has been marketed for minor indications 
such as hairy cell leukemia. 

Fig. 6. Summary of document shown in figure 5 adapted to the TREC topic number 14. The 
title of this topic is „Drug Approval^ 

LONDON -- Wellcome PLC halted work on TPA after six years of costly 
research, a setback that may spur sales for Genentech Inc.'s 
flagship version of the clot-dissolving drug. 

Wellcome said it decided to discontinue development of TPA after 
losing a U.S. patent infringement suit brought by Genentech, a 
South San Francisco company that has agreed to a merger plan that 
would give a majority stake to Roche Holding Ltd, of Switzerland. 
Wellcome previously had won a U.K. patent case involving the drug, 
used during heart attacks to restore blood flow and thus limit 
damage . 

Mr. Refsum and other analysts said Wellcome was smart to switch 
research emphasis on genetically engineered drugs from TPA to 
Wellferon, which mainly has been marketed for minor indications 
such as hairy cell leukemia. 



Fig. 7. Summary of document shown in figure 5 adapted to the TREC topic number 20 (figure 
1). The title of this topic is „Patent Infringement Lawsuits^ 
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Abstract. Finding relevant information is one of the biggest problems 
that Web users experience. This article describes Pharos, a new ser- 
vice that has been developped to help groups of Web users share their 
knowledge about interesting documents. Pharos relies on a collaborative 
infrastructure which allows user groups to index and evaluate documents 
on specific topics. This information, possibly subjective, is synthesized 
to produce personalized recommendations. Scalability is handled by dis- 
tributing servers and replicating their databases. Pharos has been imple- 
mented in Java and is currently being evaluated. 



1 Introduction 

The World-Wide Web is the easiest way to disseminate information to the global 
community. In December 1997, the estimated size of the indexable Web was 320 
million pages [15]. Nevertheless, according to a recent surveys [12], about half 
of the users consider that it is a big problem to find the information they are 
looking for. The Web infrastructure relies on hypertext and does not provide a 
mechanism to quickly find a valuable document on a precise subject. As a re- 
sult, people search information in dedicated sites that index the Web, or consult 
alternative communication channels such as news groups or mailing lists. 

In January 1999, eight out of the ten most visited sites were search en- 
gines [19]. Search engines index the content of the Web and provide a query 
interface on a Web site. Their robots travel through hyperlinks to discover new 
documents. Each document found is then indexed. Automatic indexing uses full- 
text algorithms to associate meaningful words in the page to its URL. Therefore, 
most of the search engines can be queried on words only and not on the concepts 
of the pages they index^ . So, query formulation is a difficult exercise if both noise 
and silence are to be avoided. Search engines using manual cataloging, such as 

^ Some of search engines (e.g. www.excite.com, www.infoseek.com or 
www.altavista.com) extend requests to morphologic or lexicographic variations. 
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Yahoo!^, return more relevant responses, but the user must trust the expertise 
of the people who select and organize information. Success of search engines is 
due to two main reasons: (i) they are the easiest place to find information ; (ii) 
they have a relatively good coverage of the Web [15,26]. However, one major 
drawback of search engines is that they only give links. The user must read and 
determine himself if the proposed documents are interesting or not. This work is 
time consuming and must be done by all the users for a given search. Factorizing 
this repetitive effort would help people to find more quickly the most relevant 
documents. 

Search engines do not provide evaluations about returned documents. People 
therefore use alternative systems such as news groups and mailing lists. They 
ask for information, and someone knowing relevant references can send their ad- 
vice back to the group. Such systems are highly dependent on the people who 
belong to the groups. The more expert the group members are, the more valu- 
able the information is. Frequently asked questions, off-topic questions or bad 
responses generate noise that depreciates the group. The lack of structure in 
the messages’ content makes automatic filtering difficult. Furthermore, this re- 
quires experience, and new members cannot immediately estimate the expertise 
of existing members. Because messages are not persistent, the benefit of one’s 
accumulated knowledge cannot be given to new or external members. To ad- 
dress this last point, alternatives are to build FAQs^ [11], archives, links pages 
or thematic portals. But these solutions leave unresolved the problem of their 
localization and are time consuming to maintain. There is clearly a need to pro- 
vide a system that integrates functionalities similar to search engines but which 
references valuable information such as that exchanged within groups of experts. 

Pharos is a collaborative infrastructure which allows people to share their 
knowledge on a specific topic. People finding a valuable document put an anno- 
tation in the appropriate channel. An annotation is a structured datum referenc- 
ing an URL. It is composed of values such as a title, a rating (a subjective note), 
a free comment and a list of keywords. A channel is a database of annotations 
dedicated to a specific topic (e.g. games, music, Java language, Web technolo- 
gies, and so on). People interact with Pharos by using a browser assistant. The 
browser assistant is a personal proxy with a GUI"^ (see Figure 1). It observes 
the URLs the user accesses, requests channel servers for annotations associated 
with each URL and displays them. As a result, the user can quickly evaluate the 
interest of the document. Pharos also helps to find rated information. Queries 
can be performed to extract annotated documents matching some criteria (e.g. 
all the documents rated as “good” by “bob” and categorized with the keyword 
“documentation:book” and “api:servlet”). 



^ http://www.yahoo.com/ 

® Frequently Asked Questions. 
Graphic User Interface. 
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Fig. 1. Browser assistant interaction 



As the amount of available annotations increases it becomes more difficult for 
users to exploit them. To reduce noise, annotation syntheses are computed. The 
synthesis algorithm gathers all available annotations about a given document 
into a recommendation. Since all users do not necessarily agree about document 
ratings, recommendations are personalized according to the users’ profiles. 

Pharos is based on a client/server architecture. For each page browsed, the 
browser assistant requests annotations from the channel server and displays 
them. If members are numerous this can dramatically increase the load on the 
server. If users are located world wide the latency of annotation fetching can 
become unacceptable. To improve performance, channels can be replicated close 
to users’ locations. This creates consistency issues which Pharos addresses with 
an optimistic protocol, guaranteeing the eventual consistency model [4,27]. 

In Section 2, we detail the main concepts, the design of Pharos services and 
the recommendation algorithms. Section 3 presents the implementation of the 
browser assistant, channel flexibility and the communication model. Section 4 de- 
scribes the scalability issues and how Pharos addresses them. Section 5 presents 
a survey of related work. Finally, in Section 6, we present our future work and 
conclusions. 



2 Services Design 

The Pharos system is made of two components: the browser assistant and the 
channel server (see Figure 1). The assistant works with any Web browser and 
provides a user interface to publish an annotation, to display annotations and 
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recommendations on the documents visited by the browser, to search documents 
matching some criteria, and to perform administrative tasks for channels. A 
channel may be replicated at multiple locations. This section describes what 
an annotation is, how recommendations are computed, and how channels are 
managed. 



2.1 Annotations 

Annotation structure An annotation is a structured datum, published by 
someone on a channel, to describe a Web document. The user publishes anno- 
tations explicitly. There is no “spying” on user behavior nor keeping trace of 
what has been visited. We believe strongly that it is absolutely necessary, for a 
collaborative recommendation system, to respect the privacy of users, in order 
to gain user acceptance. 

An important characteristic of an annotation is that it is structured. This 
data structure depends on the channel class. A basic class, named BasicChan- 
nel, is provided for general purpose. A BasicChannel annotation contains a title 
which is by default the document’s title, a rating which is the user’s appreciation 
of the document, some keywords which are chosen in an extensible hierarchical 
list, and a comment which is a free textual note. 

The rating is a float value ranging from -1 to -1-1. However, from the user- 
interface point of view, the rating is a discrete value displayed as an icon or a 
text defined by the channel administrator. 

An annotation may contain multiple keywords. The keyword hierarchy may 
be viewed as a simple thesaurus. This thesaurus helps the community to share 
a common, tree-structured vocabulary when annotating documents. This avoids 
having keywords that are semantically similar but lexically different (e.g. browser 
versus navigator). The thesaurus simplifies searching by keywords and gives a 
global view of the channel topic organization. Keyword hierarchy may be modi- 
fied only by authorized users. 

BasicChannel can be subclassed to satisfy specific needs. In scientific com- 
munities, users would appreciate the ability to handle bibliographical data and 
to be able to import /export data in the BibTeX [21] format. In the digital li- 
brary there is a need to support meta-data as defined in the Dublin Core [25,14]. 
Another important point is the life-time of annotations. It appears important 
to leave the user to decide, at creation time, when an annotation will become 
obsolete. All these examples show the need to be able to extend BasicChannel 
to add new attributes. 

Functionalities While browsing the Web, annotations associated with visited 
documents are shown. This information is displayed in the browser assistant. 
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The list of authors that have added annotations is displayed. When selecting 
a specific author, his annotation is shown. Additionally, recommendations are 
presented as if they were annotations published by pseudo users. There is one 
such pseudo user for each recommendation algorithm. The algorithms available 
in the BasicChannel class are described in Section 2.2. 

The user queries recommendations by filling in a form, indicating a list of 
criteria. This is a database querying facility, with regular expression search on 
titles, comments and URLs. Results are not displayed in the browser assistant 
but in the browser itself. This makes it straightforward for the user to navigate 
in the recommended documents (see figure 2). 



Channel selection 



Last visited URLs 
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Fig, 2. The assistant is on the left and the browser is displaying results of a query 
concerning XML. 



Another way to consult annotations is very similar to bookmarks or favorites 
provided in Netscape or Internet Explorer. The annotations are visible in a tree 
having the same structure as the keywords hierarchy. The leaves of this tree 
are the recommendations themselves. This user interface has been provided to 
give regular users an interface similar to the one for bookmarks/favorites in the 
browser. 
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Web access exists also, providing another way to consult annotations and 
recommendations through two levels of queries: advanced and simple. The first 
one is similar to the assistant querying. The second one provides access through 
a hypertext paradigm. 



Document identification An annotation is associated with a document. An- 
notations are stored in a channel server. To identify the associated document, 
each annotation contains an URL. This raises problems because an URL is a 
locator to access a document, not a name to identify a document: a document 
can be updated, moved, mirrored, be temporarily unreachable, or definitively 
destroyed. Moreover, there is probably a need for annotating not only docu- 
ments but also objects such as movies or CDs. A simple solution is to annotate a 
vendor page, but such a page may not exist or there may be more than one such 
page. A correct solution would be to rely on a naming service like PURL [13], 
possibly based on bar codes [17]. 



2.2 Collaborative Recommendations 

A recommendation is the synthesis of all the annotations on a single document. 
This section describes how recommendations are computed in the BasicChan- 
nel. It is important to note that computation of recommendations is made easier 
because annotations are structured data. 

The important feature is that everybody gets a personalized recommenda- 
tion. The goal is to avoid, or at least to reduce, two problems: average effect 
and pollution. If the recommendations were the same for everybody, it would 
correspond to an average advice. Minority annotations would be diluted in the 
mass. On the other hand, in the task of taking into account multiple advices, 
more weight tends to be given to some people, because they are experts or be- 
cause they share the same taste, depending on the subject. This is what Pharos 
automates. The second problem is the sensitivity to advertisement pollution. 
For instance, a vendor could inundate a channel of annotations from fake users 
praising his product. A solution is to ignore these annotations by giving a null 
weight to all these fake users. This may be done by other users explicitly. It is 
done automaticaly by Pharos with the correlation algorithm (see next sections). 
These Solutions minimize the need for a moderator. 



Computing recommendation rating The rating is a subjective value and 
plays an important role when computing recommendations. The function esti- 
mating the rating of the URL u for the member m, predictedRating{m,u), is 
defined by the following formula. Let’s call M{u) the set of members having 
added an annotation for the URL u, weight{m, n) the weight or the confidence 
placed by member m on member n in the range [—1, -1-1], rating{m, u) the rating 
of the URL u by the member m in the range [— 1,-|-1]: 
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weight{m, n) x rating{n, u) 

/ \ n^M{u) X 

prechctedRating{m^u) = = — ^ — ^ (1) 

2^ I weightym^ n) \ 

This function is a weighted average of ratings. If the denominator is null the 
function has a default customizable value. If a member m places no confidence 
in member n, that is if weight{m, n) = 0, the rating of member n is ignored 
when predicting the rating for member m. Note that the confidence placed by 
someone on somebody else depends on the subject. This implies that the scope 
of a channel must not be too large. 

The BasicChannel provides two ways of defining the weight: explicitly, or 
automatically by correlation. In the first case, the user edits the weight he uses 
for each of the other members. By default all users have the same initial posi- 
tive weight. This allows each user to distinguish some people because they agree 
(positive weight) or disagree (negative weight) with them, or don’t care about 
them (null weight). 



Computing correlation between users If there are too many users in a chan- 
nel it would be tedious to attribute a weight for all of them so the weight would 
remain the default positive value for most of them. In this case the predicted 
rating would be equivalent to the average. To solve this problem, the BasicCha- 
nnel offers a second way to automatize the weight assignment. The correlation 
algorithm determines users having similar profiles by comparing the annotations 
they have added in the past. The weight weight{m,n) given by member m to 
member n is equal to correlation{m,n) , the correlation between the two users. 
This function takes values in the range [— 1,-|-1]: from —1 for systematic contra- 
diction, up to -1-1 for systematic agreement. Let’s call U{m,n) the set of URLs 
annotated by both members m and n. 



correlation{m, n) 



Y rating{m, u) x rating{n, u) 

uGU{7n,n) 



/ ^ X 


/ ^ rating(n, u)2 


y u^U{m,n) 1 


J uGU (m,n) 



(2) 



One property of this function is to not take into account rating equal to zero. 
If the denominator is null the function has a default customizable value. This 
default value must not be equal to zero otherwise the new users would have the 
same predicted rating for all documents. On the other hand, this default value 
must be very close from zero to give small weight to fake users praising only 
their own pages. 



Other very similar algorithms have been used in recommendation systems [22,24] . 
An original point of Pharos is to empower the user by allowing the algorithm to 
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be explicitly weighted. From a user interface perspective, everything works as if 
there were pseudo-users named “weighted” and “correlated” . The pseudo-user 
“correlated” computes a prediction defining weight{m, n) as being correlation(jn, n 
The pseudo-user “weighted” computes a prediction using weight{m, n) explicitly 
given by the user. The pseudo-user “correlated” may be given a weight indicating 
the confidence of member m in the “correlated” algorithm. This way, the user 
has a complete and intuitive control over the way the “weighted” prediction is 
computed. He rates the algorithm by giving a weight, as for any other user. This 
mechanism provides a natural way to integrate other algorithms: they would 
simply be associated with other pseudo-users. 



Computing recommendation A recommendation is the synthesis of all the 
annotations on a single document and contains the same attributes. We have 
discussed how rating is used to compute correlation between users and how it 
is used to compute a predicted rating. We now describe how other attributes of 
the annotations are treated. The rating still plays an important role. 

The title and the comment of the recommendation are synthesized using a 
“best guess” algorithm. When computing a recommendation for member m on 
URL M, a title is chosen among annotations of members n so as to maximize the 
function: 

pertinence{m, u, n) = weight{m, n) x rating{n, u) (3) 

The rationale for this algorithm is to prefer title and comment from a user 
both having been assigned a good confidence and having appreciated the recom- 
mended document. 

For efficiency, the keywords are synthesized another way. The set of keywords 
associated to a recommendation for an URL is the union of all sets of keywords 
in all annotations on the given URL. 

2.3 Annotation Channels 

A channel represents a community of users sharing the same interest (e.g. the 
Java language, Web technologies or teaching). The number of users registered 
in a channel varies according the community. A channel can be composed of a 
few users annotating documents on a very in-depth topic or, at the opposite ex- 
treme, it can concern a more general topic intended for mass consumption on a 
world- wide scale. The channel stores members’ annotations in a database. It can 
extract annotations associated with a specific document, annotated documents 
matching some annotation criteria, and compute personalized recommendations. 

Channels are autonomous entities. That is, instead of having one huge data- 
base for all annotations of all topics, there are as many channels as there are 
topics. Pharos does not require the channels to all be located in the same place. 
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They can be distributed across the network. This design choice increases system 
flexibility and performance for several reasons: 

1. Channel access control policy can be customized according to each commu- 
nity. Some channels can be public, not Altering member registration, allowing 
anyone to add annotations and provide Web access to increase their acces- 
sibility. Others, can contain private data and be restricted to authorized 
members. 

2. Channel management is flexible. Creating a channel on a new topic is very 
simple and only requires a machine to host the channel. Tricky operations 
such as halting a channel for maintenance, exchanging updates for replicated 
channels, and merging or splitting channels, do not have any impact on other 
channels. 

3. Network trafflc and server load is reduced. Channels can be located close to 
their members and are only accessed by people interested in their topic. 

4. Members can replicate on their machine some of the channels they have 
subscribed to. 

5. Relevancy and performance of synthesis algorithms are improved. These al- 
gorithms produce personalized recommendations according to users’ profiles. 
Channel separation increases topic locality for each profile. For instance, Al- 
ice and Bob can have related profiles in the Java channel but opposite ones 
in the Music channel. 

6. The annotation structure can be specialized for each channel to better fit 
with the community’s requirements. For instance, a channel about distributed 
computing would have a BibTeX [21] held to annotate research papers whereas 
ones about financial news whould have a held specifying the duration of va- 
lidity of the annotation. 

Channels are located with an URL. Member creation and authentication 
is channel-dependent. Some channels can allow unrestricted subscription with 
identification based on the user name or email, whereas others require the chan- 
nel’s administrator to register a new member and use more secure authentication 
mechanisms. Once the user is logged into the channel, he can perform the opera- 
tions for which he is authorized: accessing annotations, publishing an annotation, 
editing the keyword hierarchy, and administration. 

Users can quickly create a new channel by instantiating and customizing a 
BasicChannel (see section 2.1). BasicChannel is a generic channel which has 
an all-purpose annotation structure (title, rating, comments, and a list of key- 
words). It permits searching, Web access publishing, and provides a thesaurus 
manager. At channel creation time the thesaurus is empty. Authorized mem- 
bers can add, delete, move or rename terms when needed. Finally, if the basic 
channel does not fit exactly with the community requirement. Pharos provides 
extensible framework which is able to receive any kind of channel. In this case, 
the community must develop a new channel class or extend an existing one to 
match their specific needs. 
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3 Implementation 

Pharos has been developped in Java [1] which provides useful features for build- 
ing dynamic and portable architectures. This section details the implementation 
of the channels and of the browser assistant. 

3.1 Channel Components 

Channels are mainly composed of two parts: the backend and the frontend (see 
Figure 3). The backend is the server part of a channel. It manages member sub- 
scription, the annotation database and additional data such as the thesaurus of 
keywords. The backend processing queries from frontends. It uses batch process 
to pre-compute data used later to produce recommendations. If the channel is 
replicated, each backend exchanges updates with its peers to maintain a consis- 
tent replica at each backend. If the channel has Web access, the backend provides 
HTML interface to query or publish annotations. 

The frontend is the client part of a channel. It contains the channel GUI 
hosted in the browser assistant. It allows the user to add, display and query 
annotations, to edit the thesaurus and to customize his profile. 

Frontend / backend communication relies on RMI [28] . RMI (Remote Method 
Invocation) is a Java API based on the RPC paradigm extended for the Java 
object model. Each backend exports a remote object which is invoked by fron- 
tends. However, some updates are propagated from the backend to frontends 
thanks to a lazy notification mechanism detailed in Section 4.1. 

To factorize certains resource consumption (e.g. network port, Web access, 
JVM^, code), backends hosted on the same machine can be aggregated in a 
PharosServer. It has its own RMI access which is used by the browser assistant 
to list available channels on a machine. This helps users to discover and choose 
the channels they want to subscribe to. Finally, the PharosServer has a Web 
access which gives links to backends that have registered a Web access. 

3.2 The Browser Assistant 

Pharos has been designed to annotate Web documents. Frontends must be aware 
of the URL the user is browsing. To this end several techniques exist, such as 
HTTP proxy [16], browser parasite [17], customized browser, applets or browser 
plug-ins. The proxy solution has been used in Pharos because it works with any 
browser and provides the assistant with a high degree of control to observe the 
traffic and to enrich the content of returned pages. The browser assistant re- 
lies on Pluxy [7], an extensible HTTP proxy. Pluxy can dynamically aggregate 
a set of proxy components, the pluxins. Each pluxin can observe requests and 

® Java Virtual Machine. 
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Fig. 3. Browser assistant interaction 



responses, modify them, or provide a response. 

The browser assistant is made of a pluxin which (i) observes request URLs 
in order to fetch associated annotations and (ii) extracts the HTML title of the 
page (if any) from the responses to fill the title field in the annotation editor. In 
an upcoming version, it will also be used to enrich HTML hyperlinks with icons 
pointing out the relevancy of targeted documents. 

When the user requests the annotation database, the frontend builds an 
HTML page containing the results and displays it in the browser (see Figure 
2). However, the proxy model does not allow the proxy to send requests to the 
client. The frontend therefore uses a mechanism to tell the browser to fetch 
a special URL. When the proxy receives this special URL, it calls the targeted 
frontend which then returns the HTML result page. This notification mechanism 
is browser- and OS-dependent. Under Unix/Xll, Netscape provides a special op- 
tion to open an URL®. Under Windows, Netscape and InternetExplorer provide 
a DDE^ communication handler. 



3.3 Channel Flexibility 

Pharos handles a variety of channels thanks to a composition architecture. This 
architecture relies on the JPlug framework [6] which is intended for building 
modular and extensible applications in Java. JPlug allows an application to 
confine some functionnality within components and gather these components 

® netscape -remote openURL(url) 

^ Dynamic Data Exchange. 
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into a container. A JPlug component is a set of resources such as code (Java 
classes), configuration files and icons for the GUI. JPlug allows any component 
to be dynamically installed, loaded, started, stopped and unloaded. Installing 
a component means fetching it and installing it on a local disk to be loaded. 
Loading a component means creating an instance of this component and initial- 
izing it. Start and stop events control its activity. Finally, unload removes the 
component from its container. Combined with the load functionality, this helps 
to debug and to tune a component without restarting the whole application. 
Furthermore, JPlug also provides facilities to update and reinstall new version 
of a component automatically. 

When loaded, each component receives a sandbox, i.e. a node in the file 
system, where it can create new files. A SecurityManager [9] ensures that a 
component does not access files outside of its sandbox. A component can rely 
on other components. Such a component expresses its dependencies in its de- 
scription file. JPlug ensurse that all the required components will be loaded in 
the right order®. JPlug supports component inheritance. That is, a component 
receives the resources (including the code) of its parent component but can over- 
ride some of them. 

Channels’ frontends and backends are JPlug components which are respec- 
tively plugged into the browser assistant and the PharosServer. When a user 
subscribes to a new channel, the corresponding frontend is downloaded, installed 
and loaded. The BasicChannel can be specialized with the component inheri- 
tance mechanism. The specialization consists only of properties and file settings. 
So, combined with the component loading mechanism, users can quickly and dy- 
namically create new channels by inheriting the BasicChannel component. The 
starting, stopping and unloading facilities allow PharosServer’s administrators 
to perform tricky operations on a channel without stopping all the other ones. 



4 Scalability 

Pharos distributes communities of users into channels and then pushes scalabil- 
ity issues onto the channels. A channel is said to be scalable if it can handle 
the addition of users and annotations without suffering a noticeable loss of per- 
formance [20]. This section describes two techniques used by the BasicChannel 
to reduce server load and to increase their availability: (i) frequently used data 
are cached in the frontend side, (ii) backends can be replicated close to their 
members. 



JPlug loads components according to a topologic sort; the dependency graph must 
therefore be acyclic. 
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4.1 Data Caching 

To reduce network exchanges, frontends cache some frequently used data such as 
the users list and the thesaurus of keywords. These data are infrequently updated 
but when a backend receives an update it must propagate it to all of its frontends. 
However, since RMI is a point-to-point communication protocol, forwarding an 
update to all the clients requires enumerating a list of all the frontends, to 
contact them and to send them the updates. Such an approach would be very 
inefficient and would reduce the scalability of a backend. We use therefore a lazy 
notification mechanism. Each operation is assigned a timestamp ts. The frontend 
piggybacks the last timestamp ts it received from the backend on each request. 
When a backend is invoked, in addition to the request processing, it extracts 
updates with a greater timestamp than ts and piggybacks them on the returned 
values of the request. 

4.2 Channel Replication 

Replication is another solution to scale in distributed systems [20]. It increases 
performance by reducing network distance and improves reliability by multiply- 
ing places where data are available. However, the main difficulty is to keep all 
the replicas mutually consistent. That is, when a write operation is done on a 
replica, the system must ensure that it does not corrupt the global consistency 
model of the system. 



Consistency models and protocols Consistency models fall into one of two 
categories: strong or weak [10]. The former ensures there are no inconsistencies 
in the logical view of the group. The latter allows replicas to contain invalid data. 
Weak consistency requires fewer agreement message exchanges and fits better 
with the network constraints we address. It can be achieved with two kinds of 
protocol: pessimistic or optimistic. The first prevents any write operations from 
producing conflicts when replicas exchange their updates. Indeed, some data 
are dependent on the order in which the operations are processed®. The second 
allows arbitrary write operations on any replicas and delays potential conflict 
resolution until later. 



The replication protocol The complete details of the replication protocol are 
beyond of the scope of this article. We give in this section the main principles 
which have been chosen. 

The replication protocol of BasicChannel has been inspired by Bayou [27] 
which guarantees Eventual consistency [4] . It is a weak consistency model based 

® In the literature [3] executions of such operations are said to be non- serializable. 
For instance, non-commutative operations such as addition and multiplication are 
non-serializable (e.g. a: <— a; -|- 1; a: * x 2). 
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on an optimistic protocol. It allows write operation without any agreement with 
the rest of the group. As a result, replicas can contain out-of-date data. Replicas 
synchronize their databases by exchanging their updates from peer to peer. Up- 
dates are propagated epidemically [8] , which guarantees that all the replicas will 
finally reach a consistent state^°. In BasicChannel, the partner selection pattern 
for synchronization is left to the channel’s administrator. The protocol does not 
prevent conflicts due to parallel write operations. We propose a new model to 
detect and handle conflicts. Conflict detection is semi-automatic. We distinguish 
conflicts concerning data structure integrity and those concerning the semantic 
integrity of the data. The former is handled by the collections themselves which 
ensure their data structure is not corrupted (e.g. an operation on a tree must 
not transform it into a graph). The latter is handled by the application which 
uses these collections (e.g.: a thesaurus mapped on a tree ensures there is no 
double term in a same node level). When a conflict is detected, we keep the 
conflicting operations which are considered as propositions. The administrator 
is responsible for accepting one of those operations or rejecting all. 



5 Related works 

5.1 Collaborative Indexing 

Several works address the indexing problem through collaborative systems. Marais 
and Bharat [17] uses both content and collaborative indexing. Their desktop as- 
sistant^^, Vistabar, continuously indexes^^ the full text of all viewed pages. In 
addition, they allow people to add comments about Web pages. A comment has a 
unique structure composed of the page’s URL, the page’s title, a set of categories 
choosen in a hierarchy, a short subject, a free text area and an icon indicating 
the nature of the comment. They do not address comments with specifics for 
some community. They do not have a rating model nor synthesizing mechanisms 
to reduce noise. The comments feed a centralized database which represents the 
CommonKnowledge of a community. As a result, they do not address scalability 
issues as describe in Section 4. 

The Open Directory Projector’s goal is to produce a comprehensive direc- 
tory of the Web, by relying on volunteer editors. The indexing is mapped onto a 
tree structure. Each node of the tree represents a category (e.g.: Sports:Tennis, 
Computers:Hardware, and so on). The user interface is very similar to Yahoo!; 
people browse in the categories and have a full-text search from each category. 
Volunteers are responsible for editing entries in categories. This project uses del- 
egation rather than collaboration. There is only one editor per category. Anyone 

In fact, this state would be reached if and only if all replicas stopped exchanging 
updates. That is, if no more write operations were done. 

They have developed a parasite browser which works with InternetExplorer. 

They uses the Altavista NI2 library to build full-text indexer. 
http://www.dmoz.org/ 
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can submit a site for a category but it will be validated by the editor of that 
category. Nevertheless, adding a site matching several categories requires sub- 
mitting it to all editors of these categories. In fact, the tree does not help to 
categorize entries but rather to refine the search. Each leaf of the tree is com- 
posed of a link on the site and of a short description. A very few of them may 
have an icon indicating that it is a good site. Furthermore, users must trust the 
editors and they cannot give their opinion of the indexed sites. So, it helps to 
find information but not to evaluate it. 



5.2 Recommendation Systems 

Only some representative works are presented here. For an overview of systems 
see [23]. For a discussion on the application of recommendation in digital li- 
braries see [5]. 

GroupLens [22] is a representative example of a collaborative filtering tool. 
It was primarily dedicated to net-news. Each user puts a rating on messages. 
The system computes correlations between users according to their ratings. The 
main drawback of the system lies in the network architecture. It does not address 
the scalability issues and the centralized server is often overloaded. 

Commentor [18] is one of the first annotation service dedicated to the Web. 
It is part of the Stanford Integrated Digital Library Project. It is intended to 
share textual unstructured data which are inserted in Web documents. It doesn’t 
support rating and aggregation of multiple advices. It has been developped on 
a modified version of a browser. The server handle access rights with the notion 
of group and annotation set. 

Firefly^^ and NetPerceptions^®, the commercial version of GroupLens, illus- 
trate the main target of today recommendation products. They provide personal- 
ization for visitors of commercial Web sites. For instance Amazon^® recommends 
books to their customers. This is based both on explicit recommendation and on 
previous purchases. These kinds of services are limited to visitors and customers 
of a Web site. 

Fab [2] is an example of a hybrid system. It integrates both collaborative 
and content-based recommendations. The latter tries to recommend documents 
similar to those a given user has liked in the past. It tries to avoid the limitations 
of both categories (number of similar users for the former, noise for the latter). 
The annotation contains only a rating, as for GroupLens, and the data are 
centralized. 

http : //www. f irefly.net/studio/applications/ 
http : //www.netperceptions . com/ 
http://www.amazon.com/ 
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6 Future Work 

Experimentation The major functionalities of Pharos have been implemented. 
The current work concerns the URL issues and the scalability issues. A first 
prototype has been used internally for some months and the fastest version is 
released: the browser assistant is freely available. 

Currently, experimentation is being lead by CNET^^ with the help of er- 
gonoms. Pharos will be used by teachers to recommend documents that may 
be pedagogic supports. We are supported by CNDP^® which provides us the 
database Educasource to help this experimentation. 

Inside INRIA and Bull, some technical channels are used to closely follow 
technical and scientific evolution in areas such as Java, Web technologies. Dis- 
tributed computing, or Programming. Some other channels such as Art are in- 
tended for a larger audience. We foresee the use of a meta channel to recommend 
channels themselves. 



Recommendation This paper has presented two recommendation algorithms. 
Other algorithms are envisaged. For instance, instead of using correlation be- 
tween users on their annotations, it is also possible to use profile correlation: 
if member A places a high confidence in member B, who in turn places a high 
confidence in member C, then it could be deduced that member A places a rea- 
sonable confidence in member C. 

We intend to validate the rating prediction algorithms by checking their 
output with the real ratings published by the users. We are aware that this ap- 
proach has two limitations. Firstly the best function for computing prediction 
is dependent on the function choosen to evaluate accuracy. Secondly members 
won’t annotate most documents and worse, users might annotate documents 
only when they disagree with the prediction. If they agree with the prediction, 
they are happy with it and don’t fell necessary to give their advice. 



Integration with other applications Pharos does not intend to obsolete ex- 
isting indexing services. Pharos is mostly a complementary approach. Moreover, 
Pharos may be integrated with an indexing service to provide advanced func- 
tionalities. Firstly, indexing recommended documents would allow the retrieval 
of documents according to both kinds of criteria: the subjective meta-data of 
Pharos and the content of the document. Secondly, some indexing tools are able 
to propose classifications. Pharos could submit these proposed classifications to 
the user for validation. 

CNET - Centre de Recherche et Developpement de France Telecom. 

CNDP - Centre National de Documentation Pedagogique. 
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It would be interesting to integrate Pharos with a Web cache server. Pharos 
could ask the cache to tell the user if a document is already present in the cache. 
In the other direction, Pharos could give the cache a hint of the probability for a 
document to be retrieved in the future. Basically, the higher it is rated the more 
it is recommended, and the more probable it will be retrieved. 

Finally, we expect to support the data format RDF^® based on XML^° to 
facilitate export and import to and from other applications. RDF has been pro- 
posed as a generalization of the format designed for PICS^^ and is a good can- 
didate to become a widely used standard. 

7 Conclusion 

Pharos brings a new kind of service to Web users. Some channels are publicly 
available at http : / /webtools . dyade . f r/pharos/. Compared to indexing tools, 
Pharos provides a collaborative forum for communities to exchange subjective 
information. Compared to newsgroups, it copes with heterogeneity of a large 
number of members by synthesizing automatically a personalized recommenda- 
tion for each member. Compared to other recommendation services, it provides 
more flexibility: an annotation is a structured data, not just a rating or a text; 
users can attribute a confidence to other users and to the algorithms agregating 
multiple annotations; the architecture supports the scalability thanks to repli- 
cation. Exchanging recommendations is a widely used practice in professional 
and daily life. We believe that services such as Pharos will be widely used in 
many areas. In the digital library domain, these recommendation tools will be a 
new way to manage knowledege and to involve both final users and librarians. 
In other words, we believe that human beings are the intelligent agents of the 
Internet. 
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Abstract. In this paper we present a new approach for building RDF 
schemas by integrating existing ontologies and structured vocabularies 
(thesauri). We will present a simple mechanism based on the specification 
of inclusion relationships between thesaurus terms and ontology concepts 
and show how these relationships can be exploited to create application- 
specific RDF schemas incorporating the structural views of ontologies 
and deep classification schemes provided by thesauri. 



1 Introduction 

With the emergence of the World Wide Web, Internet and Intranet technologies, 
a large number of information sources from a variety of different application do- 
mains have become available on line. In such open and evolving environments, 
discovering, accessing and integrating information are difficult and complex tasks 
due to the existence of semantic heterogeneities [35], resulting from the differ- 
ent terminologies and conceptualizations employed by the various information 
providers and consumers. 

A partial solution to the semantic heterogeneity problem is the exchange of 
domain-specific metadata [22,41,35] between interconnected systems, describing 
the semantics of the underlying information. More specifically, these seman- 
tics are expressed by metadata schemas, defined by specific resource description 
communities. A metadata schema is comprised of (1) a vocabulary, i.e. a set of 
element names to be used for the description of information in a domain (e.g. 
the creator, title elements of the Dublin Core [12] metadata element set), and (2) 
a set of semantic relationships to structure this information. One of the several 
roles of metadata schemas in open and evolving environments such as the Web, 
is to support a sharable, structural view of information with rich semantics to 
be communicated between users and applications. 

Metadata specification languages, such as the Resource Description Frame- 
work (RDF) [34,6], support standard mechanisms for the representation of meta- 
data schemas as well as source specific metadata (source descriptions). RDF is 
an ongoing standardization effort of the World-Wide Web Consortium (W3C) 
for the creation of metadata describing Web resources. Although it enables the 
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description and exchange of metadata schemas, it does not provide a mecha- 
nism to facilitate their construction, which is a difficult and time consuming 
task especially in environments that comprise a large number of information 
sources. Moreover, it offers no mechanism to decide whether a particular meta- 
data schema meets the needs of an application or domain. For that sake, we 
need to consider semantic components and structural views that describe the 
organization of the underlying information. 

In this paper, we present a modular approach for the creation of RDF schemas 
based on the integration of existing ontologies and thesaurus hierarchies defined 
according to the ISO 2788 [20] standard for monolingual thesauri. Ontologies and 
thesauri can be considered as orthogonal ways for describing information. The 
former provide structural, sharable views of information, with usually shallow 
semantics, captured in metadata schemas. They are declarative specifications of 
the concepts and roles in a domain of discourse. Thesauri are structured vocab- 
ularies, with rich semantics but little or no structure. For example, although the 
Art & Architecture Thesaurus, one of the largest thesauri in the field of western 
art terminology, includes extended taxonomies of cultural artifacts and styles, 
there is no explicit relationship denoting the fact that artifacts have a style. In 
the context of our approach, ontologies are perceived to have a dual role: provide 
a generic view of information and a structural interface over thesauri. 

We follow a three-step approach to the construction of RDF schemas. In a 
first step, we specify for each thesaurus term, a set of ontology concepts, the 
former being considered as sub-concepts of the latter. The result of this step is 
a connection relation between terms and concepts with inclusion semantics. In 
a second, intermediate step, we extract automatically for each concept a concept 
thesaurus. This thesaurus contains only the terms connected to this concept by 
the connection relation, along with broader- generic relationships derived from 
the initial thesaurus. In the final step we integrate these thesauri with the on- 
tology to produce an RDF schema consisting of (1) a structural view provided 
by the ontology, (2) connection relations between concepts and terms, and (3) 
thesaurus hierarchies. With this intermediate step it is possible to construct the 
resulting schema incrementally by extracting on demand concept thesauri that 
correspond to different ontology concepts. 

Our contribution is two-fold. First, by using existing components, we mini- 
mize the time and effort to specify appropriate notions that describe the content 
and structure of a domain in the form of an RDF schema. Second, the resulting 
RDF schema is not bound to a specific implementation and can be used by any 
application which is based on the RDF standard. 

To illustrate our approach, we take examples from the cultural application 
domain. Thesaurus examples are taken from the Art & Architecture Thesaurus 
(AAT). The Art & Architecture Thesaurus is one of the Getty Information In- 
stitute’s (http://www.gii.getty.edu/) ongoing projects and known as one of the 
largest thesauri in the area of western art historical terminology. Ontology exam- 
ples are inspired from the /C'OM/C'/DOC' Reference Model. The ICOM/CIDOC 
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Reference Model [19] is the result of one of the most significant efforts for a 
formal representation of the basic notions of the cultural application domain. 

This paper is organized as follows. Related work is presented in Section 2. 
Section 3 gives a short presentation of RDF. In Sections 4 and 5, we describe the 
notion of ontology and thesaurus respectively. In Section 6, we present our ap- 
proach to the automatic construction of RDF schemas by integrating ontologies 
and thesaurus hierarchies. Conclusions and future work are given in Section 7. 



2 Related Work 

Over the past years a great amount of effort has been invested in the develop- 
ment of metadata vocabularies for the exchange of information across different 
applications and domains [12,25,40,10]. Dublin Core [12] contributes to seman- 
tic interoperability by promoting a common set of elements which can be used 
to describe in a consistent manner information concerning the contents of elec- 
tronic documents, such as their title, creator, or subject. USMARC [40] defines a 
set of descriptive elements for the representation and exchange of bibliographic 
data. In the cultural domain, the Aquarelle Project [31] uses the SGML Cl DTD 
(Data Type Definition) of the French Ministry of Culture [10] to describe a set of 
element names, dedicated to territory inventory making. All the above metadata 
element sets are the result of the collaboration of a number of user communities 
and other authorities in the corresponding fields. Our approach can be consid- 
ered as a methodology to provide such metadata element sets by using existing 
semantic components of the domain of interest, namely ontologies and thesaurus 
hierarchies. 

Besides specific metadata element sets, ontologies have been developed and 
used in several projects to structure and access Web knowledge. The OntoSeek [18] 
system is used for gathering and organizing Web source descriptions. It exploits 
the SENSUS Ontology [24] which is based on the WordNet [32] linguistic on- 
tology to describe source contents. The WebKB set of tools [29] builds on a 
terminological ontology and conceptual graphs to represent (and index) docu- 
ments. Our approach can be considered complementary to the above systems, 
in the sense that not only do we provide a methodology to define ontologies 
enriched with thesaurus hierarchies, but also the choice of RDF as the represen- 
tation language enables their exchange in a machine readable format. 

Besides structuring and representing Web data, metadata schemas (referred 
to as domain models) are also used in mediation based systems such as Infor- 
mation Manifold [4], SIMS [9] and Carnot [11]. They provide a uniform view of 
information in a domain of discourse and are used to describe the contents of 
different information sources. For example, Carnot is an information integration 
system that relies on the CYC [28] knowledge base for describing source contents. 
The CYC knowledge base is a formalized representation of a “vast quantity of 
fundamental human knowledge” and contains about 10® general concepts and 10® 
assertions on these concepts. We do not aim at providing a mediation system 
but rather a methodology to define mediator domain models. The interesting 




Integrating Ontologies and Thesauri to Build RDF Schemas 237 



issue is that in some cases, underlying sources might use thesaurus hierarchies 
that could be integrated in the mediator to produce expressive domain models. 
Our approach can be considered as a first step towards this integration, which 
is a requirement for the next generation information systems [35,33]. 

Integrating ontologies and thesauri can be considered as a schema integration 
problem [5]. An important issue in this field concerns the coherent integration 
of database schemas with overlapping concepts, roles and data. Our approach is 
more simple since ontologies and thesauri can be considered as orthogonal ways 
of describing information. First, ontologies capture more general semantics than 
thesauri, and consequently we consider thesaurus terms to be specializations 
of ontology concepts. Second, thesauri incorporate only a fixed set of semantic 
relationships defined independently of any application or domain. Finally, the 
consistency of the resulting metaschema is not based on actual data, but on the 
meaning of ontologies and thesauri as perceived by experts in the domain. 

3 Resource Description Framework 

The Resource Description Framework (RDF) is a foundation for processing meta- 
data [34,6] which supports standard mechanisms for the representation of meta- 
data schemas as well as source descriptions. It relies on a simple, graph-based 
data model and uses XML (extensible Markup Language) [42], to communicate 
and process metadata in a machine readable and human understandable format. 
Similar to the separation of schema and instance in traditional databases, we can 
distinguish between RDF descriptions and RDF schemas, the former considered 
as instances of the latter. 

3.1 RDF descriptions 

RDF can be used to describe any kind of resource [26] that is identified by a 
URI (Uniform Resource Identifier), such as a Web server, an XML document 
or an element of an HTML page (e.g. an image). RDF supports the definition 
of resource properties whose values can be other resources or literals (strings, 
integers). A collection of property /value pairs that refers to a specific resource is 
called an RDF description and can be represented as a labeled directed graph 
where nodes correspond to resources or literals (values) and edges to resource 
properties. 

Figure 1 shows an RDF description for a Web page that describes a painting 
of the French painter Claude Monet. RDF uses the XML namespace mecha- 
nism to distinguish among different RDF schemas (Section 3.2) used in RDF 
descriptions. For example, lines 2 and 3 define two XML namespaces where the 
first (web-page) contains general properties of HTML pages (title , presents , 
creator) and the second (artifact) specifies properties of cultural artifacts 
(title, style, type, period). This mechanism is very important since it per- 
mits the reuse of existing, distinct RDF schemas within the same RDF descrip- 
tion, without creating naming conflicts (e.g. web-page : title, artifact : title). 
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Line 4 tells us that the description that follows concerns the HTML page which 
can be accessed by the URL http://metalab.unc.edu/louvre/paint/monet- 
/f irst/impression/. The title of this page is “Weh Museum: Monet, Claude 
: Impression : soleil levant ” (line 5) and has been created by Nicolas Pi- 
och (line 14). To describe properties of the painting, it is necessary to define 
a local resource which is identified by URI soleil_levant that refers to the 
painting. The painting’s properties are its type (oil painting, line 8), title 
(impression : soleil levant, line 9), style (impressionism, line 10) and pe- 
riod (first-impressionism, line 11). 



1 . <rdf :RDF xmlns :rdf="http: //www. w3 . org/ 1999/02/22-rdf -syntax-ns#" 

2 . xmlns : web-page ="http : //metalab .unc . edu/louvre/namespaces/web-pages" 

3 . xmlns : artifact ="http : //metalab .unc . edu/louvre/namespaces/artif acts"> 

4. <rdf : Description 

about="http : //metalab. unc . edu/louvre/paint/monet /first/ impress ion "> 

5. <web-page :title>Web Museum: Monet, Claude: Impression: soleil levant 
</web-page : title> 

6. <web-page :presents> 

7. <rdf : Description about="soleil_levant"> 

8. <artifact :type>oil painting</artif act : type> 

9. <artifact :title>Impression : soleil levant</artif act : title> 

10 . <artif act : style>impressionism</artif act : style> 

11 . <artif act :period>f irst-impressionism</ artifact :period> 

12. </rdf :Description> 

13. </web-page:presents> 

14. <web-page : creator>Nicolas Pioch</web-page : creator> 

15. </rdf : Description> 

16. </rdf:RDF> 



Fig. 1. An RDF description for resource http://metalab.unc.edu/louvre/paint/- 
monet/f irst/ impression. 



3.2 RDF Schemas 

The RDF Schema Specification Language [7] is a declarative language used for 
the definition of RDF schemas^ incorporating aspects from knowledge represen- 
tation models (e.g. semantic nets), database schema definition languages and 
graph models. It is a simple language of restricted expressive power compared to 
predicate calculus based specification languages such as CycL [28] and KIF [23] . 

An RDF schema defines classes and properties which can be instantiated 
in RDF descriptions. More specifically, an RDF schema is comprised of (1) a 
vocabulary, i.e. a set of class and property names to describe information in a 

^ In the following, RDF Schema will denote the specification language used to define 
RDF schemas. 
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domain (as for example, the creator, title elements of the Dublin Core metadata 
element set), and (2) a set of semantic relationships to structure this information. 
Classes are organized in hierarchies using the property rdf s : subclassOf which 
is defined in RDF Schema (namespace rdfs) and has the standard semantics of 
inheritance relationship in object-oriented data models. For example, the RDF 
schema illustrated in Figure 2 defines class Man Made Object (line 4) and its 
subclass Iconographic Object (lines 5,6). It also defines classes Style (line 
7), Period (line 8). RDF Schema allows both typed and untyped properties. 
Properties in our example are typed (i.e. they have a restricted domain and 
range). In Figure 2, property period (line 15) is defined between classes Man 
Made Object (line 16) and Period (line 17), using the RDF Schema properties 
rdfs: domain and rdfs: range respectively. 

Summarizing, RDF offers a rich, comparatively simple graph-based data 
model and supports the definition of source specific metadata (RDF descrip- 
tions) and metadata schemata (RDF schemas). It uses XML for the syntactical 
representation, exchange, and processing of these metadata. 

1 . <rdf :RDF xmlns :rdf="http: //www. w3 . org/1999/02/22-rdf-syntax-ns#" 

2 . xmlns : rdf s= "http : //www. w3 . org/TR/ 1999/PR-rdf -schema- 19990303#" 

3. xmlns : art if act=" "> 

4. <rdfs:Class rdf:ID="Man Made Object"X/rdf s : Class> 

5. <rdfs:Class rdf : ID="Iconographic 0bject"> 

6. <rdfs : subclassOf rdf : resource="#Man Made Object "/></rdfs : Class> 

7. <rdfs: Class rdf : ID="Style"X/rdf s : Class> 

8. <rdfs: Class rdf : ID="Period"X/rdf s : Class> 

9. <rdf :Property rdf : ID=" style "> 

10. <rdfs:domain rdf : resource="#Iconographic 0bject"/> 

11. <rdfs:range rdf :resource="#Style"/> </rdf :Property> 

12. <rdf :Property ID="title"> 

13. <rdfs:domain rdf : resource="#Man Made 0bject"/> 

14. <rdfs:range rdf : resource="#rdf s : Literal"/x/rdf :Property> 

15. <rdf :Property rdf : ID="period"> 

16. <rdfs:domain rdf : resource="#Man Made 0bject"/> 

17. <rdfs:range rdf :resource="#Period"/x/rdf :Property> 

18. </rdf:RDF> 



Fig. 2. An RDF schema for describing cultural resources. 



4 Ontologies 

The term ontology has been used in several disciplines, from philosophy, to knowl- 
edge engineering, where an ontology is considered as a computational entity, 
containing concepts and their properties, relationships between concepts and 
constraints. Ontologies are defined independently of the actual data [16], reflect 
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a common understanding of the semantics of the domain of discourse and are 
used to share and exchange semantic information between sources [15,33]. They 
are declarative specifications of the basic concepts and roles in an application 
domain. We only consider ontologies with inheritance relations (isa) and typed 
roles between concepts, sufficient to model a large class of ontologies [17] that 
can be easily represented as RDF schemas (Section 3.2). 

Definition 1. An ontology is a triple O = (C, i?, isa) defined as follows : 

1. C = {ci, C 2 , . . . , c„} is a set of concepts, where each concept Ci refers to a 
set of real world objects (concept instances), 

2. R= {ri, T 2 , . . . , rm} is a set of binary typed roles between concepts, 

3. isa is a set of inheritance relationships defined between concepts. Inheritance 
relationships carry subset semantics and define a partial order over concepts. 

Ontologies can be represented as directed graphs where nodes correspond to 
concepts and arcs correspond to roles and isa relationships. Figure 3 illustrates 
an example ontology, inspired from the ICOM/CIDOC Reference Model [19] 
which is used to describe cultural information. Concept Physical Object col- 
lects all physical objects, the latter composed o/ other physical objects. Activities 
(concept Activity) are associated with physical objects, the former performed by 
persons, institutions and organizations (concept Actor). Concepts Biological 
Object and Man-Made Object are sub-concepts of Physical Object and in- 
herit all roles defined in their superclass. Instances of Man-Made Object have a 
title (role has-title) and have been created in a specific period (role of -period). 
Iconographic Object is a sub-concept of Man-Made Object. Iconographic ob- 
jects have a style (role style) which is an instance of concept Style. 



is-composed-of 



performed by 

Activity 
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Physical Object 

.A 



Biological Object Man-Made Object 

A.. 



Iconographic Object 



of-period 

J ^ 






has-title 



style 



Period 

Title 
- Style 



Fig. 3. A simple cultural ontology. 
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5 Thesauri : Structured Vocabularies 

A vocabulary is a collection of terms that describe information in a domain of 
interest. Examples of such vocabularies are the ACM Computing Classification 
System [2], the Library of Congress Subject Headings [27], the Unified Medical 
Language System [39] for medicine and the Art & Architecture Thesaurus [38,1] 
for the cultural domain. Thesauri are structured vocabularies of thousands of 
terms which have and are being used as efficient means for consistent indexing 
and retrieval of information [14]. 

Thesaurus terms are considered as the “representation of concepts in the 
form of a noun or a noun phrase” [20]. Concepts are perceived by thesaurus 
developers as referring collectively to a set of objects {concept instances) [30] 
that are considered as such not with respect to a formal classification process 
but through a common agreement. Under this perspective, the interpretation of 
a thesaurus term is a set of objects, which we will call the extension of the term. 
Thesauri are said to be structured since they include a fixed set of semantic 
term relationships. Due to the set theoretic definition of terms, these semantic 
relationships are interpreted as relations between sets [13,21,36]. 

The ISO 2788 Standard [20] for the documentation and establishment of 
monolingual thesauri defines the following four kinds of term relationships which 
distinguish structured thesauri from arbitrary collections of terms : 

1. generalization (broader term generic - btg), 

2. instance (broader term - bt), 

3. partitive or part-of (broader term partitive - btp), 

4. associative (related term - rt) and 

5. equivalence (used for term - uf). 

Term relationships btg and btp are called hierarchical. In this paper we are 
only concerned with btg relationships^ which carry subset semantics and are 
the most frequently used hierarchical relationships. Hfg-relationships are tran- 
sitive and organize terms with similar semantics into directed acyclic graphs 
(DAG), referred to as hierarchies, or classification schemes. Two examples of 
6tg-hierarchies are shown in Figures 4 and 5. For example in Figure 4, term 
paintings is a broader term of oil paintings, with the interpretation that all ob- 
jects that belong to the extension of the latter, belong also to the extension of 
the former. A hierarchy is defined by its root term, a term with no broader term 
{<visual works> in Figure 4). We only assume mono-hierarchical thesauri, i.e. 
each term has exactly one broader term. 

In the following, we will consider a thesaurus as a set of hierarchies, organized 
using the hierarchical fetg-relationship. Although the definition we give is not 
complete w.r.t. all possible term relations existing in real thesauri it is sufficient 
for creating rich metadata schemata. 

Definition 2. A thesaurus is a couple T = {D, btg) such that 

^ The interested reader can refer to ISO 2788 [20] for a deeper presentation of the 
remaining term relationships. 
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miniatures 



broader term generic relationship 



Fig. 4. Part of the Art & Architecture Thesaurus hierarchy Visual Works which collects 
all artifacts that are used for visual communication (paintings, sculptures, photos). 
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Fig. 5. Part of the Art & Architecture Thesaurus hierarchy Styles & Periods which 
collects all styles, periods and movements of Art in the western world. 
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1. D = {ti,t 2 , ■ ■ -tn} is a set of terms, 

2. btg is a binary relationship between terms such that for each pair of terms 
(t,V) there exists at most one btg-path between t,t’ (mono-hierarchical the- 
saurus). 

6 Creating RDF Schemas from Ontologies and Thesauri 

In this section, we present a methodology for the construction of RDF schemas 
based on the integration of ontologies and thesaurus hierarchies. The construc- 
tion of an RDF schema is done in three steps. In a first step, we specify for 
each ontology concept, a set of terms, the latter considered as sub-concepts of 
the former. This step is similar to establishing inter-schema assertions [8,11] 
for database schema integration and cannot be a completely automated pro- 
cess since it requires the knowledge of the thesaurus and ontology semantics. 
Nevertheless, we must note that this knowledge could be partially derived from 
source data when thesauri are used to index concept instances. In a second, in- 
termediate step, we extract automatically for each concept a concept thesaurus. 
This thesaurus contains only the terms connected to this concept by the connec- 
tion relation, along with broader- generic relationships derived from the initial 
thesaurus. This process can be done automatically and does not require the 
knowledge of the ontology. In the final step we integrate these thesauri with the 
ontology to produce an RDF schema consisting of (1) a structural view provided 
by the ontology, (2) connection relations between concepts and terms, and (3) 
thesaurus hierarchies. 

Observe that by the intermediate step it is possible to construct the resulting 
schema incrementally by extracting on demand concept thesauri that correspond 
to different ontology concepts. Moreover, another benefit of the separation of the 
integration into several steps is that we are able to monitor the result at any level 
of the integration process. Last, it is important to mention that our methodology 
is not related to a specific implementation platform. 

6.1 Step 1 : Specialization of Concepts with Terms 

In the first step of the integration process, thesaurus terms are “connected” to 
ontology concepts. These connections have inclusion semantics and are repre- 
sented by a binary connection relation Con C T x C over a set of thesaurus 
terms T and a set of ontology concepts C . An example of a connection relation 
is presented in Figure 6. Terms impressionism, post-impressionism and abstract 
impressionism of the Art & Architecture Thesaurus hierarchy Styles & Periods 
(Figure 5) describe specific styles (ontology concept Style in Figure 3). Term 
first-impressionism of the same hierarchy describes both a style and a period 
(concepts Style and Period respectively). Similarly, term renaissance and its 
narrower terms of Styles & Periods hierarchy describe different types of styles and 
periods (ontology concepts Style and Period respectively). Finally, terms paint- 
ings, oil paintings and sculpture of the AAT hierarchy Visual Works (Figure 4) 
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define different kinds of iconographic objects (ontology concept Iconographic 
Object). 
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Fig. 6. A connection relation Con for AAT hierarchies Styles & Periods, Visual Works 
and ontology concepts Style, Period and Iconographic Object. 



In the following we will say that, if term t is connected to concept c, it is 
labeled by c. The way the user actually labels terms (chooses concepts to be 
connected to a given term) will not be discussed in this paper because of lack 
of space. Briefly speaking, either one labels t with some concept c with the 
assumption that all descendants of t are connected to (labeled with) c or one 
chooses explicitly among the descendants of t. 

In the previous example, we do not connect the whole thesaurus hierarchy 
Styles & Periods to concepts Style and Period. We adopt this selective ap- 
proach, i.e. relating thesaurus terms to ontology concepts explicitly, for several 
reasons. An obvious reason is that some terms could be out of the scope of the 
application that has to be described by the resulting RDF schema. For exam- 
ple, if some application is only concerned with paintings, then terms referring 
to artifacts other than paintings (e.g. sculpture, drawings) need not be consid- 
ered in the resulting schema. Another reason is that some terms (e.g. guide 
terms in [20,38]) are used to organize thesaurus hierarchies (e.g. <visual works 
by medium or technique>) and might have no use for describing information. 
Finally, another important reason is that thesaurus hierarchies might contain 
terms which can be connected to different concepts. For example, terms of the 
AAT hierarchy Styles & Periods (Figure 5) describe styles (e.g. impressionism), 
periods (e.g. art deco), or both styles and periods (e.g. renaissance). Connect- 
ing terms to concepts in a selective manner allows users to clarify between the 
multiple semantics of a term (e.g. as in the case of homonyms) and consequently 
resolve semantic ambiguities at the thesaurus level. 



6.2 Step 2 : Thesaurus Extraction 

After having defined the connection relations between terms and concepts, we 
extract for each concept in the connection relation, a thesaurus, called concept 
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thesaurus. This is done in two steps. First, each term in thesaurus 'T is labeled 
by the concepts to which it is connected in the relation Con. Observe that a 
term can be connected to several concepts, i.e. term labels are sets of concept 
names. For example, in the connection relation illustrated in Figure 6, term first- 
impressionism is connected to both Style and Period concepts. In this case, the 
label of term first-impressionism is the set of concepts {Style, Period}. 

Second, we define a selection operation a that constructs from a labeled 
thesaurus T\ and a set of concept names S a new labeled thesaurus that contains 

(1) the set of terms in 7 a whose labels contain at least one concept in S and 

(2) btg relations between these terms, induced by the btg relations in the initial 
thesaurus. More precisely : 

Definition 3. Let 7a = (D, btg) be a labeled thesaurus where each term is labeled 
by a (possibly empty) set of concepts : X : D — >■ 2^ . Let S be a set of concept 
names. The selection cr{S, 7a) creates a new thesaurus as follows : 

1. keep all terms t of T\ which are labeled by at least one concept name in S 

(snx{t)^i)), 

2. create btg relations between all terms t and t' in a{S,T\) which are related 

by a btg path in T\ that contains no term in a{S,T\). 

A naive algorithm calculating a{S, 7a) is shown in Appendix A. Since u{S,T\) 
can be evaluated without the knowledge of the underlying ontology, this algo- 
rithm can be executed on the thesaurus site without accessing information from 
the ontology. This property is useful in a distributed environment where thesauri 
and ontologies might be stored on different sites. 

Using this selection operation it is possible to define a labeled thesaurus Tf 
for each concept c in the set of ontology concepts C as follows : 

Definition 4. Let T\ = {D, btg) be a labeled thesaurus. Let c he a concept and 
Sc be the set of sub-concepts of c in C including c. Then, we can define a labeled 
concept thesaurus Tf = a{Sc,T\) which contains all terms connected (in Con) 
to c or a sub-concept of c. 

For the definition of a concept thesaurus we exploit not only the btg relations 
between the terms but also the isa relationships at the ontology level. Consider 
the example in Figure 7. Term v is labeled by concept d, term t by c and w 
by e. The selection operation on concept c will construct the thesaurus Tf that 
contains besides term t, terms v and w that are labeled by its sub-concepts. 
Observe also that a term can appear in multiple concept thesauri, and terms 
that are not labeled by any concept have disappeared from the concept thesauri. 
For example, term u is not connected to any concept and has disappeared from 
the concept thesauri in Figure 7. Moreover, the selection operation on concept 
c created a btg relation between terms w and t which were not directly related 
in the original thesaurus. 

Each concept thesaurus can be extracted independently and contains only 
a subset of the terms defined in the connection relation. This means that the 
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Labeling function 




Fig. 7. Extracted Thesaurus Examples. 



size of the extracted thesauri is bound by the number of terms in the connection 
relation and is independent of the size of the original thesaurus. 

At this point, we should mention that a concept thesaurus can be induced 
by those of its super-concepts. For example, if d is a sub-concept of c, and 
is the concept thesaurus of c, then the concept thesaurus of d can be extracted 
as follows : = (j{Sd,T^)- In the previous example, only concept thesaurus 

T\ of concept c has to be extracted from the original thesaurus 7\- All thesauri 
corresponding to sub-concepts of c might then be created on demand during the 
creation of the RDF schema (Section 6.3). 



6.3 Creation of the RDF Schema 

In this section we will present how the RDF schema is constructed out of a set of 
concept thesauri. This schema will incorporate the set of ontology concepts and 
roles, the concept thesauri defined for each ontology concept and connections 
between terms and concepts. In short, ontology concepts and thesaurus terms 
are modeled as RDF classes, ontology roles as RDF properties. Ontology isa 
relationships, connection relations between terms and concepts and btg relations 
between terms all carry inclusion semantics and are modeled with the RDF 
subclassOf property. 

The creation of the RDF schema S for an ontology O = {C, R,isa), and a 
set of concept thesauri = {D, btg) is straightforward : 

1. The set of RDF classes in S is obtained as follows: 

a) for each ontology concept c define RDF class c, 

b) for each term t in define RDF class c:t. 
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2. The set of RDF properties is obtained by defining for each typed role r(c, d) 

in R an RDF property with domain RDF class c and range RDF class d. 

3. The set of RDF subclassOf properties is obtained as follows: 

a) for each isa{c, d) relationship between ontology concepts c and d define 
an RDF subclassOf property between RDF classes c and d. 

b) for each RDF class c:t, corresponding to a root term in thesaurus T^, 
add an RDF subclassOf property between RDF classes c:t and c. 

c) for each btg relation between two terms t and t' in a concept thesaurus 
T^, define an RDF subclassOf property between RDF classes c:t and 
c:tb 

It is interesting to note that we connect only the root term of each concept 
thesaurus to the corresponding concept. Due to the transitivity of the RDF 
SM&c^assO/ property, it can be induced that a term t is a subclassOf another term 
t or a concept c. 

The RDF schema illustrated in Figure 8 has been constructed from the on- 
tology in Figure 3, the thesaurus classification schemes in Figures 4, 5, and the 
connection relation in Figure 6. 

Ontology concepts Man Made Object, Iconographic Object, Style, Period 
and terms oil paintings, paintings, impressionism and first-impressionism are 
all represented as RDF classes (lines 5,7,9,11,21,23). For simplification, we 
only prefix terms with the corresponding concept if they are contained in dif- 
ferent concept thesauri. RDF class paintings is defined as a subclass of class 
Iconographic Object (line 22), since term paintings is the root term of Icono- 
graphic Object concept thesaurus. In the same way, classes impressionism 
and first-impressionism are defined as subclasses of concepts Style and 
Period respectively (lines 26,28). Class oil paintings is a subclass of class 
paintings (line 24) (defined by the 6tg-relations between term oil paintings and 
term paintings). Ontology role style, is defined as an RDF property, its domain 
being the class Iconographic Object (line 19) and its range class Style (line 
20). By definition of the subclassOf property, all subclasses of Iconographic 
Object inherit this property. 

Using this RDF schema, one can provide RDF descriptions about specific 
web resources. For example, a new RDF description for the source described 
in Figure 1 is shown in Figure 9. When comparing this new description with 
the previous one, one can observe that we have replaced namespace artifact 
by a new namespace int which corresponds to the RDF schema in Figure 8. 
In this RDF description, semantic information that was captured as a value in 
the previous description has been added at the schema level. For example, the 
fact that the resource described an impressionist painting was encoded in the 
value of tag <artifact : style>. This value corresponds in fact to a term in 
the AAT and is represented as an instance of class int : impressionism (line 11) 
in the new schema. The same argument holds for the value first-impressionism 
which is now represented as an instance of RDF class int : first-impressionism 
(line 12). Observe also (tag <rdf :Description>) (Figure 1, line 7), has been 
replaced by a typed node tag <int:oil paintings> (line 8) indicating that 
the described resource is an oil painting. 
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1 . <rdf :RDF xmlns :rdf="http: //www. w3 . org/ 1999/02/22-rdf -syntax-ns#" 

2 . xmlns :rdf s="http : //www. w3 . org/TR/ 1999/PR-rdf -schema- 19990303#" 

3. xmlns : int=" "> 

4. <rdfs:Class rdf : ID= "Physical Object"X/rdf s : Class> 

5. <rdfs:Class rdf : ID=" Man-Made 0bject"> 

6. <rdf s : subclassOf rdf : resource="#Physical Object "/></rdfs : Class> 

7. <rdfs:Class rdf : ID="Iconographic 0bject"> 

8. <rdfs : subclassOf rdf : resource="#Man-Made Object "/></rdfs : Class> 

9. <rdfs: Class rdf : ID= "Period" ></rdfs : Class> 

10. <rdfs: Class rdf : ID="Title"X/rdf s : Class> 

11. <rdfs: Class rdf : ID="Style"X/rdf s : Class> 

12. <rdf iProperty rdf : ID="of-period"> 

13. <rdfs:domain rdf : resource="#Man Made 0bject"/> 

14. <rdfs:range rdf : resource="#Period"/x/rdf : Property> 

15. <rdf iProperty rdf : ID="title"> 

16. <rdfs:domain rdf : resource="#Man Made 0bject"/> 

17. <rdfs:range rdf :resource="#Title"/x/rdf :Property> 

18. <rdf iProperty rdf : ID=" style "> 

19. <rdfs:domain rdf : resource="#Iconographic 0bject"/> 

20. <rdfs:range rdf :resource="#Style"/x/rdf :Property> 

21. <rdfs: Class rdf : ID= "paintings "> 

22. <rdfs : subclassOf rdf : resource="#Iconographic Object"/x/rdf s : Class> 

23. <rdfs: Class rdf:ID="oil paintings"> 

24. <rdfs : subclassOf rdf : resource="#paintings"/x/rdf s : Class> 

25. <rdfs: Class rdf : ID="impressionism"> 

26. <rdfs : subclassOf rdf : resource="#Style"/x/rdf s : Class> 

27. <rdfs: Class rdf : ID="f irst-impressionism"> 

28. <rdfs : subclassOf rdf : resource="#Period"/x/rdf s : Class> 

29. </rdf:RDF> 



Fig. 8. The RDF schema resulting from the integration of the ontology and thesaurus. 
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1. <rdf:RDF 

2 . xmlns : web-page ="http : //metalab .unc . edu/louvre/namespaces/web-pages" 

3. xmlns : int ="http : //www . connectit . com/icom/aat "> 

4. <rdf : Description 

5 . about = "http: //metalab .unc . edu/louvr e/paint /monet /first /highway/ "> 

6. <web-page:title>Web Museum: Monet, Claude : Impression : 

soleil levant</web-page:title> 

7. <web-page :presents> 

8. <int:oil paintings 

9 . about = "http: //metalab. unc . edu/louvre/paintings/monet/ impress ion" > 

10. <int :title>Impression : soleil levant</int : title> 

11 . <int : styleXint : impressionism/></ int : style> 

12 . <int : of-periodxint :f irst-impressionism/> 

13 </int : of-period> 

14. </int:oil paintings> 

15. </web-page:presents> 

16. <web-page : creator>Nicolas Pioch</web-page : creator> 

17. </rdf : Description> 

18. </rdf :RDF> 

Fig. 9. RDF description for Claude Monet painting using the integrated schema. 



7 Conclusions and Future Work 

In this paper, we have presented a modular, component based approach to the 
construction of RDF schemas based on the integration of ontologies and the- 
saurus hierarchies. Our examples were taken from the cultural application do- 
main, however the presented approach can also be applied to other semantically 
rich scientific (e.g. medicine, biology or chemistry) or electronic commerce (e.g. 
electronic catalogue) applications. 

An interesting issue concerns the specification of the connection relation. 
Whereas, this relation can be specified manually for a limited number of terms 
and concepts, its creation gets cumbersome when the number of connected terms 
and concepts increases. There are two possible solutions to this problem. First, 
as already mentioned, thesaurus terms are used for indexing documents and 
other actual data. The existence of these terms at the data level might then be 
exploited for the automatic creation of the connection relation by using data 
mining techniques. The second solution consists in the definition of a query lan- 
guage for thesauri which allows to extract sets of terms by simple declarative 
queries (e.g. path expressions). In this way, the connection relation can be rep- 
resented as a conceptual view on the thesaurus where concepts are defined by 
term queries. 

The resulting RDF schema can be perceived as the domain model in me- 
diation based systems such as Information Manifold and SIMS and it plays an 
essential role in achieving semantic interoperability between the sources. This do- 
main model provides a uniform view of information in the domain of discourse. 




250 



B. Amann and I. Fundulaki 



Users pose queries against this model and information sources export their con- 
tent descriptions as views on this model. The source descriptions, are used by 
mediators to identify the set of relevant information sources with respect to a 
user query. We intend to apply our approach to the definition of semantically 
rich domain models in the context of the Artemis project [3], which proposes 
a flexible framework for the definition of cultural mediators. In the same appli- 
cation context, we consider that an important issue is the exploitation of the 
related term (synonym) relation for query rewriting and concept subsumption. 
In this direction, we intend to study more thoroughly the integration of mul- 
tilingual thesauri and the exploitation of inter-thesaurus relations for creating 
multi-lingual RDF schemas. Such schemas can be used effectively in the context 
of querying multi-lingual information sources. 

A first prototype is currently under implementation. We have already im- 
plemented a loader of RDF descriptions into the O 2 object-oriented database 
management system of Ardent Software using the SiRPAC [37] RDF parser to 
parse RDF documents (RDF schemas and descriptions). The next step is to 
provide tools for specifying the connection relation and create concept thesauri. 
Finally, we want to provide a query interface based on OQL for selecting re- 
sources according to their descriptions. 
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A A Naive Algorithm for Thesaurus Extraction 

Input : (il) set of concepts S 

(12) a thesaurus T = (T, btg) 

(13) a labeling function A 

(14) a set of root terms R in T 
Output : (ol) a thesaurus ct(7a) 

Procedure : 

Ts ■■= 0; btgs := 0 

for all terms r in i? 

for all narrow terms t of r 
createJbtg{r, t) 

createJ)tg{r, t) : if A(r) 0 S' = 0 
u = t 

else 

add r to Ts 

if A(t) n S 0 
add t to Ts 
add (t, r) to btgs 
u = t 
else 
u = r 

for all narrow terms t' of t 
create J)tg{u, t') 
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Abstract. This paper introduces a novel approach to supporting digital library 
users in organising and annotating material. We have extended the concept of 
open hypermedia by introducing typed links, which support: addition of (user- 
defined) semantics to hypertexts, user navigation, and machine supported 
analysis and synthesis of hypermedia structures. The Webvise open hypermedia 
system is integrated with the World Wide Web, and has been augmented with a 
type system. We illustrate the potential use in the context of digital libraries 
with a scenario of teachers jointly preparing a course based on digital library 
material. 



1 Introduction 

Digital libraries may be viewed as a digital extension and enhancement of traditional 
libraries. The extensions and enhancements take place along several dimensions 
including the collections of the library, access to material, management of the library 
collection, and communication about the items in the collection [36]. Thus, a digital 
library can be seen as a collection of both transient and persistent documents that 
allows for work ranging from individual to collaborative work. Digital libraries may 
also be seen as a kind of hypermedia in the line of Bush’s visions on creating a 
machine (the Memex) to help humans handle scientific literature [6]. Hypermedia 
technology is by many researchers viewed as a potential, powerful technology for use 
in implementing digital library systems [12], [22] , [30], [38]. The World Wide Web 
(WWW) [3] hypermedia system is thus playing an important role in the 
implementation of most digital library systems emerging today. 

This paper focuses on the application of hypermedia in the context of digital 
libraries. However, we aim at adding services to existing digital libraries rather than at 
constructing a hypermedia infrastructure for digital libraries per se. In doing this, we 
focus on approaches to applying open hypermedia to the problem of organising and 
annotating digital library material. Supporting digital library users in organising and 
annotating material found through querying of digital libraries is a problem domain 
that has been addressed by many digital library researchers [20], [25], [34]. This 
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paper addresses the problem faced by digital library users when having to use digital 
information for specific tasks. 



1.1 The WWW and Digital Libraries 

The WWW is emerging as an infrastructure implementing part of the universally 
available body of information envisioned by the early hypertext pioneers. 
Nevertheless, the WWW still suffers from a number of problems. These include that: 

• Users need to own pages in order to link from them, 

• making new anchors in documents requires these documents to be changed, 

• collaboration on live documents is not supported, and 

• no other structures than documents and links exist. 

Adding to this the sheer size of the WWW, the classical problem of "being lost in 
Hyperspace" [8] is magnified. This means that the WWW per se is insufficient to 
support dynamic annotation and organisation of digital library material. 



1.2 Open Hypermedia for the WWW 

The work presented in this paper build upon approaches to integrate the WWW and 
open hypermedia [15], [16]. The Webvise open hypermedia service [15] provides 
structures such as contexts, links, annotations, and guided tours, stored in hypermedia 
databases external to the Web pages. In this way, Webvise supports users in creating 
links from parts of Web pages they do not own, and to parts of Web pages without 
writing HyperText Markup Language (HTML) [21] target tags. The method for 
locating parts of Web pages can locate parts of pages across frame hierarchies and it 
is also supports certain repairs of links that break due to modified Web pages. Support 
for providing links to/from parts of non-HTML data, such as sound and movie is 
possible via interfaces to plug-ins and Java based media players [4]. 

The hypermedia structures are stored in a hypermedia database, based on the 
Devise Hypermedia framework [4], and the service is available on the WWW via an 
ordinary Uniform Resource Locator (URL). The best user interface for creating and 
manipulating the structures is currently provided for the Microsoft Internet Explorer 
browser pages (see Fig. 1) through Object Linking and Embedding (OLE) [5] 
integration that utilise the Explorers Document Object Model 
(http://www.w3.org/DOM/) representation of Web. But the structures can also be 
manipulated and used via special Java applets and a pure proxy server solution is 
provided for users who only need to browse the structures. A user can create and use 
the external structures as "transparency" layers (called contexts) on top of arbitrary 
Web pages, the user can switch between viewing pages with one or more layers 
(contexts) of structures or without any external structures imposed on them. Webvise 
clients are communicating with servers using the most recent version of the so-called 
open hypermedia protocol (OHP) [9], which is being developed by the open 
hypermedia systems working group OHSWG (http : / /www . ohswg . org). 
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Fig. 1. Screenshot of Webvise Client User Interface. 



1.3 Link Types and Digital Library Material 

Trigg [37] introduced the notion of link types in hypermedia. Link types were to add 
semantics to the relationships between nodes in a hypertext. In Trigg [37], link types 
were semantic the relationships that researchers were creating when reviewing and 
discussing electronic scientific literature. In successor systems, such as NoteCards 
[17] and MacWeb [27], link types has been used for structuring of many other kinds 
of information. 

An important aspect of introducing link types is that it supports the author of 
hypermedia links in expressing his/her intension with a specific relationship, but 
equally important it can support the reader in filtering and sorting the link 
relationships according to the link types assigned. Following the original ideas of link 
types in hypermedia they have many potentials in the area of digital libraries which 
should support users not only in finding material of interest but also to continue using 
the found material for individual or group purposes. Here a hypermedia system 
supporting link types will improve the digital library user in dynamically organising, 
commenting and discussing material with colleagues. This paper demonstrates an 
approach to construct an open hypermedia system supporting link types generally 
available to users who access digital libraries through the WWW. 
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1.4 Structure of the Paper 

The structure of this paper is as follows: In section 2, we discuss link types and their 
usage in hypermedia and digital libraries. Section 3 and 4 discusses requirements for 
the implementation of typed links, and presents a prototype, called WebviseLT in this 
paper, based on the Webvise open hypermedia system. Then, Section 5 discusses the 
use of WebviseLT in a digital library use scenario. Section 6 discusses future work, 
and section 7 concludes the paper. 



2 The Notion of Link Types in Hypermedia 

This section introduces the notion of link types in hypermedia in general and relates it 
to other relevant areas of research. 

Link types, or more generally types for, or knowledge of, objects in hypermedia 
structures, have been proposed as a means for reifying manageable conceptual models 
in hypermedia structures [27]. Having typed links may thus be useful in a number of 
ways, including that link types: 

• add (user-defined) semantics to hypermedia structures, 

• reduce user disorientation, and 

• facilitate machine supported analysis and synthesis of hypermedia structures. 

Links in a hypermedia structure provides structure between nodes allowing users to 
navigate. A binary link can be unidirectional, only allowing the user to jump from a 
node A to a destination B, but some systems provides bi-directional links which can 
be traversed in both directions. In general a link may have several destinations and the 
link itself may be traversed from any one to any other of these. Thus we can speak of 
physical direction of a link, expressing the navigational (and in some sense temporal) 
relationship between nodes, telling the reader how to navigate through the 
hypermedia structure. 

With untyped links, regardless of which of the previous forms it is in, it is left to 
the anchor context to explain the purpose of the link, i.e. why it was created, and what 
the reader may expect to find at the other end of the link. Considering links between 
two nodes, types generally allows for a semantic interpretation of the relationship 
between the nodes, e.g. A comments on B’, thereby giving the possibility of 
expressing knowledge about the link as a property of itself rather than solely by the 
context in which it is placed. 

The type of the link describing a (binary) relationship between two nodes often 
implies a semantic direction as well as a physical. As pointed out by Trigg [37] these 
need not be the same. If a node B comments on a node A, the author of a context 
might want the reader to follow a link to a node B only after having read node A. The 
intended physical direction of the link would then be from A to B, whereas the 
semantic direction would be from B to A, expressing that "B is a comment on A". The 
distinction between physical and semantic direction can cause problems interpreting 
the relationship between nodes when the two directions differ. In the example above 
we have a physical direction from A to B. However, the semantic direction of the 
’Comments on’ link type has a semantic direction from B to A. In systems allowing 
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only unidirectional links, in which the physical direction of links do not change, the 
problem can be solved by the author making sure that the physical and semantic 
directions are the same when creating the context. The directionality problem 
increases if the system allows bi-directional links because the physical direction 
depends on which node is selected as source-node, whereas the semantic direction is 
fixed. This may lead to a situation in which both nodes could be interpreted as e.g. a 
comment to the other. Ideally, in this case the system should offer a mechanism to 
reverse the semantic direction of the link, depending on the physical direction 
(especially if nodes are not typed in the same manner as links, thus providing the 
reader a valuable help in determining the relationship between them). 

The Dexter Hypertext Reference Model [18] proposes general bi-directional and n- 
ary links. Supporting such links raises a number of new issues on directionality and 
types. N-ary, unidirectional links with only one source-node but several destinations 
behave much like binary links (if the relationship between the source and each of its 
destinations is the same). In this case, the author could provide a link from a node A 
to nodes B, C, and D commenting on A. However, if the there is not the same 
relationship between all the nodes of the link, the type of the link is restricted to 
express the most general relationship. This relationship is a common supertype if the 
types form a hierarchy. N-ary, bidirectional links make a special case. The link then 
connects a set of nodes with no fixed set of source nodes, and so a type can only 
express the common relationship between all the nodes of the link. In this case the 
semantic direction does not matter, as the semantic interpretation will be the same no 
matter in which way the link is traversed. 



2.1 Link Types in Classical Hypermedia Systems 

A taxonomy of link types for use in the hypertext system TextNet was presented by 
Trigg [37]. TextNet was designed to aid in creating and reviewing scientific papers. 
The taxonomy was accordingly fixed. The taxonomy consists of more than 80 types, 
mainly divided into Normal’ and ’Commentary’ link types. Examples of the link types 
belonging to the Normal’ category intended for providing the coimection between 
nodes in the text includes ’Citation’ and ’Argument’ (both being divided into further 
subtypes), as well as Eormalization/ Application’ (depending on the direction of the 
link). The types in the ’Commentary’ category are intended to connect statements 
about a node to the node, including ’Comment’, types concerning style. Trigg notes 
that a user-extensible type system is conceivable, but argues that the taxonomy is 
extensive enough to cover any use of the system within the boundaries intended. 
Additionally, Trigg raises three practical issues in allowing dynamic type systems: 
Explosion of link types, reader confusion and system confusion. 

Explosion of link types covers the situation in which the type system grows 
unmanageably as a result of modifications made by one or more users of the hypertext 
system. This is partially coimected to the second consideration, that of reader 
confusion. If users are allowed to extend the type system incrementally, it is very 
likely that they will not fully understand the modifications made by others. This may 
mean, that they will add new types or modifying existing ones to cover what other 
users has already expressed, or misusing types for something they were not intended 
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for. System confusion, the third consideration, has to do with the relationships 
between the hypertext system and the link types. Some systems may have a partial 
understanding of the semantics of the type systems. In this case, the users may be 
required to give information to the system concerning new types. 

It will however almost certainly be necessary to allow user-configurable type 
systems in general purpose hypermedia if one wishes to allow typed links. If the 
domain in which the system will be used can not be determined in advance, then 
neither can an appropriate type system. It is then up to the users of the system to 
handle explosion of link types and reader confusion. 

Several systems have a notion of typed links. JANUS [11] is an example of such a 
system in that it is a combination of an issue-based hypertext system supporting 
argumentation and a knowledge-based system. Issues in the underlying hypermedia 
system are connected to each other by relationships like ’is more general than’. 
NoteCards [17] is an example of a system allowing typed links with a semantic 
direction. A type is simply a label selected by the user to describe the relationship 
between the source and destination node. In this way, NoteCards allowed user-defined 
types. Consequently, it can in this respect be seen as an extension of the principles in 
TextNet, with a dynamic type system to allow for user-customisation with regard to 
specific usage. 

The type systems mentioned above can be extended from being ’merely’ typed for 
the purpose of semantic reading to be type systems which allow link types to have 
(user-defined) attributes as well as scripts attached to its types. In [27], an example of 
an object oriented model, implemented in MacWeb, is given. In such a model a link 
type can define attributes as well as scripts, and it can inherit attributes and scripts 
from multiple supertypes, giving the hypermedia structure the potential of being used 
in a wide range of use scenarios. The type system is implemented and modified in the 
hypermedia structure along with the contents itself. 



2.2 Link Types on the WWW 

In the case of the WWW, HTML allows the marking of a region in a node as being a 
link to an entire node or into an anchor in a node. In HTML links are unidirectional, 
and only supports one source and one destination anchor. They are also embedded in 
the documents they link. The most common use of a link in HTML is to fetch another 
document. However, the suggested HTML+ standard [32] made a proposal for the 
embedding of types into links. Types were to be embedded in links and the type of a 
link should be specified via the ’rel’ and ’rev’ attributes of a link. Some predefined 
types for processing of documents existed, while user defined link types were also 
made possible. The current HTML standard proposal [21] contains a subset of the 
HTML+ specification on link types. 



2.3 Link Types in Open (Hypermedia) Systems 

Many open hypermedia systems have been implemented, e.g.. Chimera [1], 
Microcosm [10], DHM [13], HyperWave [26], and MultiCard [33], but none of them 
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support a link type system coming close to that of TextNet or NoteCard. Microcosm 
supports different link behaviours for what is called specific links, generic links, local 
links, and retrieval links. However, these different link behaviours do not correspond 
to any semantic link types as described above. Some of the systems, e.g. DHM, 
supports the addition of general user-defined attribute/value pairs to objects in 
hypermedia structures. This means that, in principle, a simple way to provide types is 
provided: Users or implementers may use the attribute/value pairs to encode a type of 
a hypermedia object. However, no use has generally been made of this facility with 
respect to providing types. Thus, to our knowledge, no open hypermedia system has 
implemented thorough support for link types, including hierarchical type systems and 
editors and browsers for link types. The ’ComMentor’ architecture [34] provides 
means for annotating Web pages. This facility may be used in a similar way as 
attribute/value pairs to simulate types of links. 



2.4 Types in Programming Languages 

Since the 50’s, types have been an integral part of many programming languages [35]. 
In programming languages, a type is a set of entities having some properties in 
common. An example of a type is ’integer’: Integers may be added, multiplied, and so 
on, whereas integers cannot be compared to strings. Any entity belonging to the set of 
’integers’ is guaranteed at least to have the properties of integers. Thus, types in 
programming languages may be viewed as providing the user of an entity with a 
"contract" on the capabilities of that entity. Object-orientation [24] gives a highly 
structured view on types. Entities in an object-oriented program (’objects’) are seen as 
instances of concepts (’classes’). These concepts are the types of (typed) object- 
oriented programming languages. Just as concepts in the real world may be structured 
by means of classification and composition, object-oriented programming languages 
support classification and composition of types. Types may be classified in 
taxonomies, or classification hierarchies, and types may be used in defining other 
types through composition. The notion of (link) types for open hypermedia introduced 
in this paper is inspired by the object-oriented view on types. 



3 Designing and Implementing Typed Link Support for Digital 
Libraries 

In this section, we first discuss the design of a link type system for open hypermedia. 
Then we discuss the integration of open hypermedia in digital libraries. Finally, based 
on this discussion, we describe the implementation WebviseLT, a prototype of an 
open hypermedia system extended with link types, based on Webvise. 
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3.1 Design of Support for Typed Links 

As discussed, types can be seen as modelling concepts that cover relationships 
between components in the hypermedia domain that a concrete hypermedia structure 
is all about. Trigg [37] discusses scientific writing and mentions types such as 
comments and argument. Nanard et al. [27] discuss the use of types (in general) to 
represent knowledge and mention link types such as "Has Painted" or "Has Carved" 
for educational hypermedia structures. Ideally then, a design of typed links would 
incorporate means for at least classification and composition of data and actions. This 
disqualifies the simple types of e.g. NoteCard [17] (that are merely tags or labels on 
existing links). Also, if link types are to be used dynamically in daily use, several 
conditions must be met. First, it should be possible to have typed and non-typed links 
co-exist. Second, the type system should be so dynamic that it allows evolution while 
in use. The necessary dynamics may be achieved by, among others, allowing users to 
add attributes to links without changing their type, to change the type system without 
"breaking" the types of existing links, and to change the type of links dynamically. 
Then, if a general-purpose open hypermedia system, like Webvise, is to implement 
link types, it certainly should not implement a fixed type system, but rather a dynamic 
one specific to individual contexts. 

We thus view types as evolving, external, and encompassing data and action: A 
link type has a set of data and actions, and link types can be specialised. As an 
example, in this conceptual view Trigg’s ’Argument’ link type is a specialisation of a 
’Normal’ link type. It might have an attribute ’Against’. The ’Normal’ link type might 
have an action. Any link with type ’Argument’ would then have an ’Against’ attribute 
and the action of the ’Normal’ link type. 



3.2 Integrating Open Hypermedia in a Digital Library Architecture 

Since our approach is to introduce typed links in digital libraries by means of an open 
hypermedia system, we need to describe how open hypermedia may fit in as a value 
adding service in arbitrary digital libraries accessible via the WWW. Open 
hypermedia systems as well as digital library systems may be roughly described as 
three-tier architectures as shown in Fig. 2 in a slightly modified version of Bass et 
al.’s [2] software architecture notation. 
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The digital library material is accessed via a WWW portal in the architecture 
above. The portal may integrate a number of library sources through a common 
Webserver and metadata gateway. The open hypermedia system can be integrated in 
an orthogonal manner and the hypermedia structures can be stored externally to the 
library sources. The open hypermedia service may be provided on the same server as 
the digital library portal server or on a separate server. The only assumptions are that 
the Web browser in question can communicate with the open hypermedia client 
(Webvise in this case) and that the addresses (URLs), provided by the WWW portal 
server, are stable. 



3.3 WebviseLT Architecture and Implementation 

Extending the storage layer of Webvise to implement link types would seem ideal: 
Webvise is written in the object-oriented language BETA [24], and the structure of 
our link types coincide with the structure of object-oriented types. This is, however, 
not a very good idea for several reasons. First, the structure of link types are most 
profitable created in situ, and BETA - or any other typed, class-based language - is 
currently not capable of supporting this in the generality that is demanded. Second, 
there is a general trend in (open) hypermedia towards standardisation 
(http://www.ohswg.org) meaning that directly extending the storage layer of 
Webvise, implementing link types and sub-types of the fink’ class in Webvise, would 
be problematic since it almost certainly would have to change. Third, types may best 
be viewed as external to hypermedia structures. In this way types may have an 
existence across different hypermedia (and hypermedia systems). 
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The implementation of WebviseLT is based on the current version of Webvise and 
runs on the Windows NT/9x platform. Fig. 3 shows how the architecture of 
WebviseLT fits into the general open hypermedia architecture model described above. 




Fig. 3. Current Architecture of WebviseLT 



The WebviseLT client is shown as a process communicating via TCP/IP with the 
Webvise server. The Webvise server implements a standard, layered open hypermedia 
system architecture. The WebviseLT client communicates with external applications 
such as Microsoft Word and Microsoft Internet Explorer using OLE communication. 

Only subcomponents of the WebviseLT client that concern the implementation of 
link types are shown. Of these, the Link Type Facade’ component is central for the 
extension of Webvise. The instantiation and associated use of the component is an 
application of the Domain Model Concealer’ architectural pattern [19]. It handles all 
saving, loading, and interpretation of link types on the client side by effectively 
transforming it’s interface into calls to the server. This means that the OHSWG 
standard communication protocol used by Webvise has been slightly extended to cater 
for link types. Since the facade is the sole point of handling of link types in the client, 
changes to representation of link types are simple to make. This was seen as a 
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necessity since the server component of the Webvise system was actually changed to 
conform to the new standards of the OHSWG. 



Scripting. A vital part of semantics is action. We make a preliminary attempt at 
providing this for link types by incorporating scripts written in TCL/TK [31] into 
links. Since attributes with the name ’Script’ are interpreted as TCL/TK scripts, scripts 
are edited in the client in the same way that attributes are edited. An example of this is 
shown in Fig. 5, below. The script maintains a usage count (’noOfUsages’) for the link 
and a date at which the link was last followed (’lastUsed’) for any link with a type that 
is ’Link Type’ or a specialisation thereof 

The current simple implementation executes the scripts on the client, and only 
when following links or presenting anchors, but more elaborate tailoring schemes are 
imaginable and desirable. 



Representing Link Types. The directed acyclic graph of link types is saved in XML 
format [39] as an attribute on a context object. An example of this is shown in Fig. 4. 

<linktypes> 

<linktype> 

<typeid>0</typeid> 

<typename>Link Type</typename> 

<typeattributes> 

<attrname>noOfUsages</attrname> 

<at t rvalue> 0 < /at t rvalue > 

</typeattributes> 

<supertypes></ supertypes> 

</linktype> 

<linktype> 

<typeid>l</typeid> 

<typename>Quality of Source</typename> 

<typeattributes> 

< at t rname >Why< / at t rname > 

< at t rvalue ></ at t rvalue > 

</typeattributes> 

<supertypes> 

<typeid>0</typeid></ supertypes> 

</linktype> 

</linktypes> 



Fig. 4. Example of XML Format of a Link Type System. 

The example defines a link type system with two types Link Type’ and ’Quality of 
Source’. ’Link Type’ has a ’noOfUsages’ attribute with the default value ’O’. ’Quality of 
Source’ also has a ’noOfUsages’ attribute with the default value X)’, since it is a 
subtype of ’Link Type’. Moreover, ’Quality of Source’ has an attribute ’Why’ with no 
default value. If a link has a type, the type is saved as an attribute on the link. This 
attribute has an id into the link type system as value. If the link does not have a type, 
such an attribute does not exist on the link. 

By using XML format, the link type system can be represented and stored 
externally to contexts. This means that it becomes possible to exchange link type 
systems, not only between contexts, but also between different hypermedia systems. 
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More, combination of link type systems may also facilitated. This may potentially 
lead to conflict with respect to e.g. naming. See section 6.2 for a discussion of this. 



Integration with Other Applications. In order to use the typing mechanism 
efficiently in actual work, the functionality of WebviseLT needs to be integrated with 
commonly used applications such as Microsoft Word or Internet Explorer. This 
means, among others, that external applications need to be able to visualise types of 
links. As an example, the OLE communication from Webvise to Microsoft Internet 
Explorer was changed to pass on the name of the type for the link being presented. 
Currently, the name of the type is simply inserted as a ’title’ on links [29] This creates 
a tool-tip with the name of the link type when the mouse is placed over the link. In 
this way (part ot) the knowledge incorporated in the link type is passed on to the user 
immediately. 

If the link is not explicitly typed no type description will appear. Later we will 
touch upon what type of information to give the user when presenting links. 



4 Dynamic Use of Link Types in WebviseLT 

In this section, we discuss aspects of the creation and use of link types based on 
WebviseLT. 



4.1 Tailoring the Type System 

In WebviseLT, editing of link types is done in a graphical link type editor presenting 
the user with a view of the type system (Fig. 5.) Using the editor, a user may freely 
add and delete types as well as add and remove specialisations between types. For any 
type, a number of attributes and their default values may be specified. In Fig. 5 it is 
e.g. indicated that the basic link type (’Link Type’) has five attributes (Type name’, 
’Owner’, LastUsed’, ’noOfUsages’, and ’Scripf). This means that any link that has the 
type ’Link Type’ (or a specialisation hereot) is guaranteed to have at least these five 
attributes. Furthermore, such links will have a ’Script’ attribute with the default value 
shown in the window on top. 

It is possible to change the type hierarchy after having assigned link types to links. 
This introduces the difficult problem of schema evolution and database reorganisation 
[23]: Since a type for a link may be viewed as guaranteeing a specific interface of a 
link, any editing of the link type hierarchy must preserve this property. In the current 
implementation any addition to the link type hierarchy that does not affect existing 
typed links is unproblematic. An example of such a change would be to add an 
’Intermediate’ type as a specialisation of the ’Required Knowledge’ type. However, a 
change to the type of an existing typed link will affect that link. If the type is 
removed, the link will be untyped. If the type is changed, the link will have the type 
reapplied. Thus, the interface property is preserved. 
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set today [clock format (clock seconds] -format %Y%m%d| 
set uses [getAttr noOlUsages] 
setAitr noOdJsages [incr uses] 
setAttr LastUsed $today 






Fig. 5. Link Types Editor. 



4.2 Dynamic Assignment of Types 

Evidently the user should have a way to specify and inspect the type of a given link. 
This is accomplished through the tree-based type selection dialog. With a link of a 
given type the list will open with that type marked (Fig. 6.) If the link is not typed, no 
element of the list will be marked. 



B Link Type 

B Required Knowledge 
Introductory 
Experienced 
B Advanced 




B Quality of Source 

S Recommend as Primary Readi 
Useless 

Recommend as Background F 
& Referential 

Elaborates 

Exemplifies 

Explains 

Contradicts 



OK I Cancel 




Fig. 6. Available Link Types Viewer and Selector. 
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A most general type will be the root of the tree and subtypes of a type will be 
shown on a branch node of a type’s node. In the case of multiple inheritance for a 
given type it will appear under each super type. This may potentially lead to 
exponentially large trees, but it certainly gives the most accurate description of the 
relation between types. Upon accepting the dialogue, the selected link in the 
WebviseLT client is assigned either the chosen type or no type if no type was chosen. 
After the assignment of a type to a link, the user may freely add and remove attributes 
of the link not connected to the assigned link type (not shown). One might also allow 
the user to choose more than one type from the dialogue, thus allowing multiple 
inheritance for a single link to be declared on the fly. 



4.3 Querying Contexts With Typed Links 

A general query framework for contexts has been examined. This has been done using 
a (slightly changed) and simplified version of the Object-oriented Query Language 
(OQL) [7]. This general query mechanism has been used to implement a simple query 
mechanism for searching contexts in WebviseLT 

The user interface is rather simple (Fig. 7). It allows for searches on hypermedia 
components such as links and anchors using a list of simple conjunctions or 
disjunctions. In Fig. 7, a context named ’Flypermedia Course 99’ is searched for link 
components with type ’AdvancedPrimary’ that has a usage count greater than 5. 




Fig. 7. The Search Dialog. Searching for Links with Type ’AdvancedPrimary’ that 
have been followed at least five times. 
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By pressing "More", the user may include more search criteria. Pressing "Less" 
would mean that the search for "Attribute Value" that "is greater than" would be 
excluded from the search. This means that the search will be for links with type 
AdvancedPrimary’ that has an attribute named "noOfUsages". The result of the 
current search is, as shown in the results canvas, four elements representing links. If 
such an element is double-clicked, the first node, to which the link belongs, will be 
presented in the node browser of the WebviseLT client and the link itself will be 
presented in the link browser of the WebviseLT client.ln order to communicate the 
query, it is translated to an equivalent (simplified) OQL query and executed on the 
server. 



5 Applying WebviseLT in Digital Libraries: A Use Scenario 

This section describes a typical use of digital libraries in the planning of a new course 
based on digital library material. Open hypermedia can be used to organise digital 
library material for the specific purpose of a user. Organisation may consist in 
creation of collections, hierarchies, linked paths, and annotations. The notion of 
semantic types further enriches the means for organising material. 



5.1 The Setting 

Christiana is teaching at a school of library science and Miguel teaches at a computer 
science department. Together they are planning a new course in hypermedia that will 
have participants from both sites. Since their daily workplaces are in different 
locations, they decide to do the planning over the Web supplemented with phone and 
email conversations. They have access to both international and national libraries as 
well as publishers through a common Web portal. They also have access to the 
WebviseLT system, which enable them to share their planning documents and their 
collaborative structuring of the digital library material they find via the portal. Both 
Christiana and Miguel are creating documents on their desktop containing outline 
proposals of the course, lists of proposed readings, etc. The central documents and the 
type system they create are illustrated in the Figs 5-7, with a temporary plan as the 
central document with typed links to the different sources. This plan is placed on a 
server that both have access to. 



5.2 Organising Material 

When planning the specific sections of the course, the teachers issue a number of 
queries for relevant digital library materials. The results are organised with the 
WebviseLT system supporting link types. The dynamic planning documents are 
managed in MS Word extended with open hypermedia, and the library sources are 
accessed via an open hypermedia augmented ntemet Explorer. A temporary plan for 
the topics of the course is made in Microsoft Word after phone based discussions. 
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The two teachers start introducing link types to distinguish between the links they 
create to the digital library sources. They develop their own conventions about 
semantic link types similar to the types introduced by Trigg [37] for reviewing 
scientific literature. For instance, they introduce link types such as Introductory’, 
’Experienced’, and ’Advanced’ to distinguish sources based on the knowledge required 
to read them (Fig. 5). To facilitate the discussion of the quality of the sources, they 
introduce link types such as ’Recommend as Primary Reading’, ’Recommend as 
Background Reading’, and ’Useless’. As they are collecting source material, they 
create anchors for links in the planning document and links to the relevant sources. 
Link types are then assigned to the links (Fig 7) and used to communicate why the 
actual link to a document is included. As they continue to organise the material, links 
with multiple types are needed. To express, e.g., that a text that is ’Recommended as 
Primary Reading’ requires a knowledge level of ’Advanced’ in order to read it, 
Christiana creates a new link type ’AdvancedPrimary’ that has ’Advanced’ and 
’Recommend as Primary Reading’ as supertypes and assigns this type to the link. New 
link types are introduced as needed. 

Both of the above sets of link types are meant to express assessments about the 
destination node as a whole. Also a number of types are used to provide meaning to 
the referential links between anchored segments of the material. These types are 
typically called Elaborates’, Exemplifies’, ’Contradicts, Explains’, etc. Miguel may 
also decide to prepare the hypermedia structures for monitoring the students use of the 
material. To facilitate simple monitoring of the students’ use of the context they 
create, Miguel adds attributes to all link types (’noOfUsages’, ’lastUsed’) and create 
scripts to process the use of the context (Fig 6). 



5.3 Use of Material 

The teachers use the type mechanisms to generate filtered browsers for their students, 
based on their own collection of resources. They decide to provide their students with 
filtered browsers for resources of the types ’Recommended as Primary reading’ and 
’Recommended as Background reading’ for each week as well as the full length of the 
course. Their use of attributes and scripts enables them to get a picture of the usage of 
the online material. If e.g. they want to know how many link followings there has 
been to advanced material they have found "recommended as primary reading", they 
may perform a search as shown in Fig. 7. 

As the contexts given to the students are identical to the teachers, except that 
useless resources are filtered out, the students have access to the referential links 
between papers as well as the notes made by the teachers. To some of their students, 
this is a nice feature. Although they may prefer to read the articles on paper rather 
than the screen, they will go through next week’s context in WebviseLT as part of 
their preparation. Doing this, they may find that the referential links and the 
occasional notes help in reading the papers. And as the course progresses, they may 
lift parts of the resources in to their own contexts in order to make some of their own 
thoughts of the papers and their relationship persistent. 
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6 Status and Future Work 

This section first describes a brief status of the WebviseLT prototype and then we 
briefly discuss a number of possible extension to WebviseLT and the notion of types 
in open hypermedia in general. The approach to adding typed links to open 
hypermedia has been implemented in WebviseLT as an extension to Webvise. The 
current version of WebviseLT may be downloaded from 
http : //www . daimi .au.dk/ -mar ius /hypermedia/ webvise/. 



6.1 Typed Components 

In this paper, we have primarily focussed on typed links. The above discussion about 
type systems for links generally applies to type systems for any kind of component in 
a hypermedia system [28]. In the object-oriented model of MacWeb, e.g., types for 
nodes can be handled in much the same way as types for links. While the type of links 
helps to identify the relationship between the source and the target(s) of the link, the 
type of nodes may help identify the classification of a node. This is probably even 
more so if the type system for links is ’weak’, meaning a link of a given type may lead 
to more than one type of node. Such an extension would be fairly easy in WebviseLT, 
since the mechanisms used to handle attributes (of which the type it self is one) for 
links would work for all types of components. 

In the same manner, WebviseLT could easily be extended to allow typing of its 
built-in pop-up notes. WebviseLT supports insertion and subsequent sharing of notes 
on arbitrary points of a webpage. When selected the note will be displayed in a dialog 
box. Typing of notes would, as is the case with links, provide the reader with 
additional information about the contents of the note, prior to following it, making it 
possible for the reader to avoid opening notes that are not of interest. 



6.2 Combination of Contexts 

We tend to believe that dynamic type systems will generally evolve into relatively 
closed type systems, which only rarely needs to be modified. If this is true, then we 
should certainly cater for the users, who creates new contexts every once in a while. 
We took the design decision of placing the type system in the context, rather than 
globally in the system, but it ought to be able to transfer or share type systems 
between contexts. This poses some problems, however. What should be done if a type 
system is imported into a context with an existing type system? Should it simply 
replace the old type system, or should the system try to merge them? This again 
introduces the problem of schema evolution that could be handled more generally 
than in the current implementation. The user might for example use domain 
knowledge to specify a graceful merge of type systems. 

Another issue is importing one context into another. If the components of the 
imported context have been typed, how should this be handled? In particular, clashes 
of type names require human resolving. 
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6.3 Presenting Navigational Information 

As previously discussed, it is necessary to visualise link types in external 
applications. When Webvise is further extended to allow other kinds of typed 
hypermedia components, this knowledge has to be displayed as well. Yet a possible 
extension would be to show the document type of the target, i.e. the application which 
will present the target node. This can only be done if there is only one possible target 
for a link, but it could be used in a list of possible targets as well. The benefit of 
presenting one or more of the above navigational information would be added 
predictability when navigating the context. Possibly the user could avoid following 
links which led to targets of no particular interest in the current use situation. 
However it would also mean an added overhead for presenting nodes, if all links in 
the node had to be resolved in advance to determine the attributes of the target(s). 
This is not a problem when confining the system to show only the type of the link, 
since this information is accessible anyway. 



6.4 Typing Types 

Although our implementation supports an object-oriented type description, the actual 
types are not stored in such a way at all. To gain the full strength of this approach we 
still have to consider a number of enhancements. Especially important is typing of 
attributes of types: Currently all attributes are considered as being of (data) type 
’string’. An extension could also support composition and association between types, 
supporting the structuring mechanisms from object-oriented programming languages 
in full generality. Also, when the standardised server is in place real, strongly typed 
links via dynamic linking or interpretation could profitably be considered. This will 
also include extended and general scripting support. 



7 Conclusions 

This paper has introduced an approach to support digital library users in organising 
and annotating material found in digital libraries. This approach is based on open 
hypermedia. It has taken the open hypermedia approach a step further by introducing 
support for typed links in open hypermedia and in turn demonstrated the potentials for 
digital library users. 

Link types in open hypermedia support: addition of (user-defined) semantics to 
hypermedia structures, reduction of user disorientation, and machine supported 
analysis and synthesis of hypermedia structures. A type specifies a set of attributes 
and methods are guaranteed to exist for links of that type, thus giving semantics to 
links. Moreover supporting users in defining their own types with corresponding 
semantics to links is important in the context of emerging structures. The Webvise 
open hypermedia system, which is tightly integrated with the MS Internet Explorer, 
has been augmented with a type system according to this approach. Finally, the 
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potential use of WebviseLT in the context of digital libraries has been illustrated with 
a scenario of teachers jointly preparing a course based on digital library material. 
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Abstract. This paper gives an overview of tools and methods for Cross- 
Language Information Retrieval (CLIR) that are developed within the 
Twenty-One project. The tools and methods are evaluated with the 
TREC CLIR task document collection using Dutch queries on the En- 
glish document base. The main issue addressed here is an evaluation of 
two approaches to disambiguation. The underlying question is whether a 
lot of effort should be put in finding the correct translation for each query 
term before searching, or whether searching with more than one possible 
translation leads to better results? The experimental study suggests that 
the quality of search methods is more important than the quality of dis- 
ambiguation methods. Good retrieval methods are able to disambiguate 
translated queries implicitly during searching. 

Keywords: Cross-Language Information Retrieval, Statistical Machine 
Translation. 



1 Introduction 

Within the project Twenty-One a system is built that supports Cross- 
language Information Retrieval (CLIR) . Cross-language retrieval supports 
the users of multilingual document collections by allowing them to submit 
queries in one language, and retrieve documents in any of the languages 
covered by the retrieval system. On the example of Dutch queries on an 
English document collection, this can be achieved by: off-line document 
translation: translating English documents into Dutch, then indexing in 
Dutch; off-line index translation: indexing English documents in English, 
then translating the resulting index into Dutch; on-line query translation: 
indexing English documents in English and translating Dutch queries on 
the fly into English. The latter method is preferred when the former two 
are impractical. Query translation is envorced in environments where it 
would be impossible to produce translations for all documents in the doc- 
ument base and/or to produce translated indices for each language. Doc- 
ument translation has the major advantage that it is possible to present 
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the user a high quality preview of the retrieved documents. Translating 
documents after they are retrieved, as offered by some web search engines, 
does not suffice because it does not help the users to identify material that 
they might want to have translated. Since it presupposes that the user 
has already found the relevant document in its original foreign language, 
it fails to support exactly that part of a search in a multilingual environ- 
ment which is the most difficult one: to formulate a query which will then 
take the user to the foreign language document of interest. 

The Twenty-One project has a clear target domain. It focuses on dis- 
closing literature on sustainable development in four languages: Dutch, 
English, French and German. The project also has a strong focus on the 
disclosure of paper documents which have to be scanned and converted to 
an electronical format by optical character recognition software. A third 
focus is on natural language processing in the four supported languages. 
At indexing time noun phrases are recognised and used as complex index 
terms. As the Twenty- One domain is limited and as heavy preprocessing 
and storage of scanned documents has to be reckoned with anyhow, this 
is a classic case for the document translation approach. The Twenty-One 
system^ uses sophisticated translation software for the disambiguation of 
index terms in context. If a word has more than one possible translation 
it is called ambiguous, e.g. the English word plant has two possible French 
translations plante for the sense of ’vegetation’ and usine for the sense 
of ’factory’. The term disambiguation is used in two ways in this paper. 
Disambiguation refers to the process of choosing the best translation. 
However, disambiguation might also refer to the estimation of probabil- 
ities for each possible translation. The disambiguation process might for 
instance assign a probability of 0.8 to plante and 0.2 to usine. The prob- 
abilities can be used to identify the most probable translation, but, if 
the query translation approach is taken, they might also be used during 
retrieval to weight each possible translation. Currently disambiguation in 
Twenty-One can be pursued in four ways: 

— using existing machine translation software (LOGOS) 

— selection of the preferred translation from a machine readable dictio- 
nary (Van Dale) 

— the use of domain specific dictionaries that are automatically gener- 
ated on the basis of statistically processed parallel corpora (suited for 
specific applications only) 

^ Twenty-One was the first on-line text retrieval system supporting CLIR in Europe: 
http: / / twentyone.tpd.tno.nl / 2 Idemomooi / 
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— disambiguation on the basis of the frequency of noun phrases in the 
document collection 

In this paper the different disambiguation strategies of the Twenty-One 
system will be evaluated. The paper addresses the question which strategy 
results in the best retrieval performance, but it also addresses the question 
if disambiguation is necessary at all. New probabilistic retrieval techniques 
that are developed at the University of Twente are able to disambiguate 
translated queries implicitly during searching. 

This paper is organised as follows. Section 2 explores possibilities for 
the comparison of the ’’document translation” approach with the ’’query 
translation” approach. Section 3 introduces three basic methods for query 
translation. Section 4 addresses heuristics and statistics for disambigua- 
tion if the query translation approach to cross-language retrieval is fol- 
lowed. Section 5 discusses the setup of our experiments and experimental 
results. Finally, section 6 contains concluding remarks. Technical details 
of the probabilistic retrieval model can be found in the appendix of this 
paper. 

2 Empirically Comparing Document Translation With 
Query Translation 

In the introduction three important advantages of document translation 
were mentioned. Firstly, it can be done off-line. Secondly, if a classical 
machine translation is used, it is possible to present the user a high qual- 
ity preview of a document. Thirdly, there is more context available for 
lexical disambiguation which might lead to better retrieval performance 
in terms of precision and recall.^ For several types of applications, the 
first and second advantage may be a good reason to choose for document 
translation. The third advantage seems quite plausible and was hypoth- 
esised in a number of early publications on cross-language retrieval, e.g. 
Oard and Dorr [16], Hull and Grefenstette [10] and Kraaij [11]. 

Does the document translation approach to cross-language retrieval 
using classical machine translation really lead to better retrieval perfor- 
mance than the query translation approach using a machine readable 
dictionary? A recent experimental study by Oard [15] suggests it does. 
However, for a number of reasons it is very difficult to answer this question 
on the basis of empirical evidence. A first problem is that in the query 

^ Precision is the fraction of the retrieved documents that is actually relevant and 
recall is the fraction of the relevant documents that is actually retrieved. 
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translation approach, searching is done in the language of the documents 
while in the document translation approach searching is done in the lan- 
guage of the query. But it is a well known fact that information retrieval 
is not equally difficult for each language. A second problem is that, for a 
sound answer to the question, we need a machine translation system and a 
machine readable dictionary that have exactly the same lexical coverage. 
If the machine translation system misses vital translations that the ma- 
chine readable dictionary does list, we end up comparing the coverage of 
the respective translation lexicons instead of the two approaches to cross- 
language retrieval. Within the Twenty-One project we have a third, more 
practical, problem that prevents us from evaluating the usefulness of the 
used translation system (LOGOS) against the usefulness of the machine 
readable dictionaries available within the project (Van Dale). The Van 
Dale dictionaries are entirely based on Dutch head words, but translation 
from and to Dutch is not supported by LOGOS. These considerations urge 
us to rephrase the the issue into a more manageable question. 

A first, manageable, step in comparing document translation with 
query translation might be the following. What is, given a translation 
lexicon, the best approach for query translation: using one translation 
for each query term (i.e. explicit disambiguation) or using all possible 
translations? Picking one translation is a necessary condition of the doc- 
ument translation approach. For query translation we can either use one 
translation for searching, or more than one. The question one or more 
translations also reflects the classical precision / recall dilemma in infor- 
mation retrieval: picking one specific translation of each query term is a 
good strategy to achieve high precision; using all possible translations of 
each query term is a good strategy to achieve high recall. 

3 Methods for Query Translation 

As said in the previous section one of the issues dealt with in this paper is 
comparing cross-language information retrieval using one translation per 
query term with retrieval using more than one translation per query term. 
We will report the results of retrieval experiments using the Dutch queries 
on the English TREG cross-language task collection. A Dutch query will 
be referred to as the source language query; the English query will be 
referred to as the translated query. The experiments can be divided into 
three categories: 

1. query translation using one translation per source language query 
term 




278 D. Hiemstra and F. de Jong 



2. query translation using unstructured queries of all possible transla- 
tions per source language query term 

3. query translation using structured queries of all possible translations 
per source language query term 

3.1 Using one Translation per Query Term 

If only one translation per query term is used for searching, the transla- 
tion process must have some kind of explicit disambiguation procedure. 
This procedure might be based on an existing machine translation system, 
or alternatively, on statistical techniques or heuristics. After disambigua- 
tion, the translated query can be treated the way a query is normally 
treated in a monolingual setting. A ’normal’ monolingual setting in this 
context is retrieval on the basis of a statistical ’bag-of- words’ model like 
e.g. the vector space model [20] or the classical probabilistic model [18]. 
The unstructured queries mentioned in the next section will also refer to 
the use of a bag-of-words model. Instead of the vector space model or 
the classical probabilistic model we will use a new model, called the lin- 
guistically motivated probabilistic model of information retrieval, which 
is described in the appendix of this paper. 

Figure 1 gives an example of an English query {third, world} that is 
used to search a French collection. Although both third and world might 
have more than one possible translation, the system has to pick one of 
them. 



{third, world} 

dictionary lookup 
and disambiguation 
{tiers, monde} 



i 



Fig. 1. using one translation per query term 



In section 4 a number of heuristics and statistics for disambiguation will 
be explored. As explained in section 2 we will not be able to actually use 
machine translation for disambiguation. It is however possible to define 
an upper bound on what is possible with the one-translation approach 
by asking a human expert to manually disambiguate the output of the 
machine readable dictionary. We hypothesise that query translation using 
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a machine translation system with the same lexical coverage as the ma- 
chine readable dictionary will not result in better retrieval performance 
than query translation using the manually disambiguated output of the 
same dictionary. 

3.2 Using Unstructured Queries 

If more than one translation per source language query term is used for 
searching we might still treat the translated query as a bag-of- words. As 
we will see in section 5 the way of weighting the possible translations is 
crucial for unstructured queries. In particular it is important to normalise 
the possible translations in such a way that for each source language 
query term the weights of possible translations sum up to one. Not using 
normalisation will make source language query terms with a lot of possible 
translations unintentionally more important than source language query 
terms that have less possible translations. 

Figure 2 again gives the example of an English query {third, world} 
that is used to search a French collection. It is assumed that the En- 
glish term third has two possible Erench translations: tiers and troisieme 
and that the English term world has three possible translations: monde, 
mondial and terre. Instead of selecting one translation we might use all 
possible translation to search the document collection. 



{third, world} 






dictionary lookup 
(tiers, troisieme, monde, mondial, terre) 



Fig. 2. translation using an unstructured query 



The result of figure 2 could be used directly for searching the Erench col- 
lection (see run2a in section 5), but this would make the term world in 
the source language query more important (because it has more possible 
translations) than the word third. Normalisation of the possible transla- 
tions might therefore be used to make the contribution of third as high 
as the contribution of world. In this case the possible translations of third 
are reweighted to 0.5 and the possible translations of world to 0.33 (see 
run2c in section 5). If one of the possible translations of one source 
language query term is more probable than the other(s), this possible 
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translation might be weighted higher than the other(s) while keeping the 
normalisation in tact. 



3.3 Using Structured Queries 

If all possible translations are treated as one bag-of-words we ignore the 
fact that a document containing one possible translation of each source 
language query term is more likely to be relevant than a document con- 
taining all possible translations of only one source language query term. 
The boolean model or weighted boolean models (see e.g. [20]) can be used 
to retrieve only those documents that contain a translation of all or most 
of the source language query terms [9]. Disjunction can be used to com- 
bine possible translations of one source language query term. Conjunction 
can be used in a way that the translated query reflects the formulation 
of the source language query. 



{third, world} 



{{tiers U troisieme) 



^ dictionary lookup 
{monde U mondial U terre)} 



Fig. 3. translation using a structured query 



Figure 3 again gives the example of an English query {third, world} on 
a French document collection. The structured query reflects the possible 
translations of the source language query terms in an intuitive way. The 
structured query weighting algorithm implicitly normalises the possible 
translations in a disjunction. Explicit normalisation as done for unstruc- 
tured queries is no longer necessary. Structured queries are generated 
automatically by the translation module and may take the probabilities 
of possible translations into account. Technical details of the algorithm 
can be found in the appendix. 

4 Heuristics and Statistics for Disambiguation 

This section lists a number of information resources that can be used to 
identify the proper translation or proper translations of a query term. 
The section briefly describes information that is explicitly or implicitly 
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in the dictionary and information from other sources like parallel corpora 
and the document collection itself. 

4.1 Dictionary Preferred Translation 

The VLIS lexical database of Van Dale Lexicography list for each entry ex- 
plicitly one preferred translation which is considered the most commonly 
used one. Replacing each query term with the preferred translation is a 
simple, but possibly effective, approach to cross-language retrieval. 

4.2 Pseudo Frequencies 

The Van Dale database contains also explicit information on the sense 
of possible translations. Some Dutch head words carry over to the same 
English translation for different senses. For example the Dutch head word 
jeugd may be translated to youth in three senses: the sense of ’characteris- 
tic’, ’time- frame’ and ’person’. The ’person’ sense has a synonym transla- 
tion: youngster. As youth occurs in the dictionary under three senses and 
youngster under one sense, we assign youth a weight that is three times 
as high as the weight for youngster. The assumption made by weight- 
ing translations is that the number of occurrences in the dictionary may 
serve as rough estimates of actual frequencies in parallel corpora. In other 
words: the number of occurrences in the dictionary serve as pseudo fre- 
queneies. Ideally, if the domain is limited and parallel corpora on the 
domain are available, weights should be estimated from actual data as 
described in section 4.3. 

4.3 Frequencies from Parallel Corpora 

The Twenty-One system contains documents on the domain of sustain- 
able development. Translation in Twenty-One is done using a general 
purpose dictionary (Van Dale) and a general purpose MT-system (LO- 
GOS), but these resources are not very well suited for domain-specific jar- 
gon. Domain-specific jargon and its translations are implicitly available 
in parallel corpora on sustainable development. Translation pairs can be 
derived from parallel corpora using statistical co-occurrence by so-called 
word alignment algorithms. Within the Twenty-One project word align- 
ment algorithms were developed that do the job in a fast and reliable way 
[6]. Domain specific translation lexicons were derived from Agenda 21, a 
UN-document on sustainable development that is available in most of the 
European languages including Dutch and English. 
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For the experiment we merged the automatically derived dictionary 
with the Van Dale dictionary in the following way. For each entry, we 
added the pseudo frequencies and the real frequencies of the possible 
translations. Pseudo frequencies are usually not higher than four or five, 
but the real frequencies in the parallel corpus may be more than a thou- 
sand for frequent translation pairs. Adding pseudo frequencies and real 
frequencies has the effect that for possible translations that are frequent 
in the corpus the real frequencies will be important, but for translations 
that are infrequent or missing the pseudo frequencies will be important. 

Translation pairs that have a frequency of one or two in the parallel 
corpus may-be erroneously derived by the word alignment algorithm. If, 
however, such an infrequent translation pair is also listed in the machine 
readable dictionary, then the pair was probably correct. Therefore we 
added a bonus frequency of three to each possible translation that is 
both in the corpus and in Van Dale. 

4.4 Context for Disambiguation 

The techniques introduced so far do not resemble techniques that are ac- 
tually used in machine translation systems. Traditionally, disambiguation 
in machine translation systems is based on (syntactic) context of words. 
In this section a statistical algorithm is introduced that uses context of the 
original query words to find the best translation. The algorithm uses can- 
didate noun phrases extracted from the document base to disambiguate 
the from the query. Noun phrases were extracted using the standard tools 
as used in the Twenty-One system: the Xerox morphological tools and the 
TNO parser. The noun phrases were sorted and then counted, resulting in 
a list of unique phrases with frequency of occurrence. 

The introduction of noun phrases (or any multi-word expression) in 
the translation process leads to two types of ambiguity: sense ambiguity 
and structural ambiguity. Figure 4 gives an example of the French trans- 
lation chart of the English noun phrase third world war. Each word in this 
noun phrase can have several translations that are displayed in the bot- 
tom cells of the chart, the so-called sense ambiguity. According to a list of 
French noun phrases there may be two candidate multi-word translations: 
tiers monde for the English noun phrase third world and guerre mondiale 
for world war. These candidate translations are displayed in the upper 
cells of the chart. Because the internal structure of noun phrases was not 
available for the translation process, we can translate a full noun phrase 
by decomposing it in several ways. Eor example third world war can be 
split up in the separate translation of either third world and war or in the 
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tiers monde 


guerre mondiale 




troisieme 


monde 


guerre 


tiers 


mondiale 

terre 


bataille 


third 


world 


war 



Fig. 4. translation chart of third world war 



separate translation of third and world war. The most probable decompo- 
sition can be found using techniques developed for stochastic grammars 
(see e.g. [2]). The probabilities of the parse trees can be mapped into prob- 
abilities, or weights, of possible translations. A more detailed description 
of the algorithm can be found in [12]. 

4.5 Manual Disambiguation 

The manual disambiguation of the dictionary output was done by a quali- 
fied interpreter which also was a native speaker of English. She had access 
to the Dutch version of the topics and to the English dictionary out- 
put consisting of a number of possible translations per source language 
(Dutch) query word. Eor each Dutch word, one of the possible English 
translations had to be chosen, even if the correct translation was not one 
of them. 

4.6 Other luformatiou 

In the experiments described in this paper we ignored one important 
source of information: the multi-word entries in the Van Dale dictionaries. 
Multi-word expressions like for instance world war are explicitly listed in 
the dictionary. Eor the experiments described in this paper we only used 
word-by-word translations using the single word entries. 

5 Experimental Setup and Results 

In section 3 we identified three methods for query translation: using one 
translation per query term, using a unstructured query of all transla- 
tions per source language query term and using a structured query of all 
translations per source language query term. Each method is assigned a 
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number 1, 2 or 3. In section 4 five sources of information were identified 
that may be used by these methods: dictionary preference, pseudo fre- 
quencies, parallel corpora, context in noun phrases and human expertise. 
Given the five information sources we identified seven (two experiments 
were done both with and without normalisation) basic retrieval experi- 
ments or runs that are listed in table 1. Each experiment is labelled with 
a letter from a to g. 



Table 1. disambiguation methods 



run name technique to weight translations / pick the best translation 

run?a no weights used / dictionary preferred translation. 
run?b weight by pseudo frequencies. 

run?c normalise weights of possible translations (run?a) 
run?d weight by normalised pseudo frequencies 
run?e normalised ’real’ frequencies estimated from the parallel 
Agenda 21 corpus. 

run?f weight by using noun phrases from documents (including 
normalisation) 

run?g disambiguation by a human expert 



The combinations of seven information sources and three methods define 
a total number of 21 possible experiments. After removing combinations 
that are redundant or not informative 15 experiments remain. 

In the remainder of this section we will report the results of 15 ex- 
periments on the TREC cross-language task test collection [3] topics 1-24 
(excluding the topics that were not judged at the time of TREC-6 leaving 
21 topics). The Dutch topics were used to search the English documents. 
Experiments were compared by means of their non-interpolated average 
precision, average precision in short. Additionally, the result of each ex- 
periment will be compared with the result of a monolingual base line run, 
which is the result of queries based on the English version of the TREC 
topics. The monolingual run performs at an average precision of 0.403. 
All experiments were done with the linguistically motivated experimental 
retrieval engine developed at the University of Twente. 

5.1 One Translation Runs 

Table 2 list the results of the one translation runs. Normalisation of trans- 
lation weights is not useful for picking the best translation. Therefore the 
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table does not list runic and run Id. (run Id would give exactly the 
same results as runlb.) 



Table 2. results of ’one-translation’ runs 



run name 


average 

precision 


relative to 
baseline (%) 


runla 


0.262 


65 


runlb 


0.231 


57 


runle 


0.282 


70 


runlf 


0.269 


67 


runlg 


0.315 


78 



Not surprisingly, the manual disambiguated run outperforms the auto- 
matic runs, but it still performs at 78 % of the monolingual run. Trans- 
lation ambiguity and missing terminology are the two primary sources of 
cross- language retrieval error [10]. We hypothesise that the loss of perfor- 
mance is due to missing terminology and possibly errors in the translation 
scripts. The 78 % performance of the monolingual base line is an upper 
bound on what is possible using a one-translation approach on the TREC 
cross-language collection. 

The best automatic run is the run using corpus frequencies runle. 
This is a surprise, because we used a relatively small corpus on the do- 
main of the Twenty-One demonstrator which is sustainable development. 
Inspection of the topics however learns us that a lot of topics discuss 
international problems like air pollution, combating AIDS, etc. which fall 
directly in the domain of sustainable development. 

The dictionary preferred run runla performs reasonable well. The 
run using context from noun phrases runlf performs only a little bet- 
ter. Pseudo frequencies runlb are less useful in identifying the correct 
translation. 

5.2 Unstructured Query Runs 

Table 5.2 list the results of the unstructured query runs using all pos- 
sible translations of each original query term. We experimented with all 
information sources except for the human expert. 

A first important thing to notice is that the normalisation of the term 
weights is a prerequisite for good performance if all possible translations 
per source language query term are used in an unstructured query. Not 
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Table 3. results of ’unstructured query’ runs 



run name 


average 

precision 


relative to 
baseline (%) 


ruu2a 


0.180 


45 


ruu2b 


0.162 


40 


ruu2c 


0.268 


67 


ruu2d 


0.308 


76 


ruu2e 


0.305 


76 


ruu2f 


0.275 


68 



using the normalisation, as done in run2a and run2b will drop perfor- 
mance to a disappointing 40 to 45 per cent of the monolingual base line. 

More surprisingly, the pseudo frequency run run2d and the real fre- 
quency run run2e now perform equally well and both approach the upper 
bound on what is possible with the one translation approach (runlg). 
Although the pseudo frequencies are not very useful for identifying the 
best translation, they seem to be as realistic as real frequencies if used 
for weighting the possible translations. 

5.3 Structured Query Runs 

Table 4 lists the results of the structured query runs. Normalisation of 
term weights is implicit in the structured query, so ruu3a and ruu3b 
will give exactly the same results as ruu3c and ruu3d respectively. 



Table 4. results of ’structured query’ runs 



run name 


average 

precision 


relative to 
baseline (%) 


ruu3c 


0.311 


77 


ruu3d 


0.330 


82 


ruu3e 


0.335 


83 


ruu3f 


0.323 


80 



The four runs do not differ as much in performance as their unstructured 
equivalents, which suggests that the structured queries are more robust 
than the unstructured queries. Again, the pseudo frequency run run2d 
and the real frequency run run2e perform almost equally well. Three 
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out of four runs perform better than the manually disambiguated ’one 
translation’ run runlg. 

6 Conclusion 

This paper gives an overview of methods and information resources that 
can be used for cross-language information retrieval. Evaluation of these 
methods on the TREC cross-language collection indicates that using all 
possible translations for searching leads to better retrieval performance 
in terms of average precision than using just one (the best) translation. 

In several early publications on cross-language retrieval [10,11,16] it 
was hypothesised that the document translation approach to cross-language 
retrieval leads to better retrieval performance than the query translation 
approach because there is more context available in documents for lexi- 
cal disambiguation. Of course, lexical disambiguation is easier if there is 
more context available, but we claim that lexical disambiguation is not 
essential for good retrieval performance. In fact, table 4 shows that the 
best performing runs simply use all possible translations. The results of 
the manually disambiguated run suggest that not much can be gained by 
putting a lot of effort in explicit disambiguation of possible translations. If 
sophisticated search algorithms are used, disambiguation is done implic- 
itly during searching. This suggests that the hypothesis that document 
translation leads to better retrieval performance than query translation 
might not be true after all: further research is needed on this topic. 

The appendix of this paper describes some important steps in the 
development of new probabilistic retrieval models. It introduces a new 
method to rank documents using boolean structured queries and it intro- 
duces a new way to include statistical translation directly into statistical 
retrieval. In the cross-language retrieval experiments reported on here, 
boolean structured queries outperform the unstructured queries. In fu- 
ture publications we hope to show that this method, although it needs 
the Boolean queries to be in conjunctive normal form, is also useful in a 
monolingual setting with Boolean queries that are formulated directly by 
the user. 
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Appendix: Probabilistic Weighting Algorithms 

The weighting algorithms for structured and unstructured queries are 
based on the linguistically motivated probabilistic model of information 
retrieval [5,7,8]. This appendix gives an overview of the model and of its 
application to cross-language information retrieval. 

A.l An Informal Description of the Underlying Ideas 

The linguistically motivated probabilistic model is based on advances 
made in the field of statistical natural language processing and uses prob- 
ability theory in quite a different way than the classical probabilistic 
approaches to information retrieval, like e.g. the well-known Robertson / 
Sparck- Jones probabilistic model [18]. It uses a metaphor that is very sim- 
ilar to ’urn models’ that are often used in introductory statistics courses 
[14] . Instead of drawing balls at random with replacement from an urn, we 
will consider the process of drawing words at random with replacement 
from a document. Suppose someone selects one document in the docu- 
ment collection; draws at random, one at a time, with replacement ten 
words from this document and hands those ten words (the query terms) 
over to the system. The system now can make an educated guess as from 
which document the words came from, by calculating for each document 
the probability that the ten words were sampled from it and by ranking 
the documents accordingly. The intuition behind it is that users have a 
reasonable idea of which terms are likely to occur in documents of interest 
and will choose query terms accordingly [17]. 

The model can be extended to Boolean queries by treating the sam- 
pling process as an AND-query and allowing that each draw is specified by 
a disjunction of more than one term. Eor example, the probability of first 
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drawing the term information and then drawing either the term retrieval 
or the term filtering from a document can be calculated by the model 
introduced in this paper without any additional modeling assumptions. 

Furthermore it can be extended with additional statistical processes 
to model differences between the vocabulary of the query and the vo- 
cabulary of the documents. Statistical translation can be added to the 
process of sampling terms from a document by assuming that the trans- 
lation of a term does not depend on the document it was sampled from. 
Cross-language retrieval using e.g. Dutch queries on an English document 
collection uses the sampling metaphor as follows: first an English word is 
sampled from the document, and then this word is translated to Dutch 
with some probability that can be estimated from a parallel corpus. 

A. 2 Definition of the Corresponding Probability Measures 

Based on the ideas mentioned above, probability measures can be defined 
to rank the documents given a query. The probability that an unstruc- 
tured query Ti, T 2 , • • • , of length n is sampled from a document with 
document identifier D is defined by equation 1. 

n 

P{T^,T2, - ■ ■ .Tn\D) = + a2P{Ti\D)) (1) 

i=\ 

The probability measure is defined by a linear combination of global in- 
formation P{T) on the terms and local information P{T\D) on the terms. 
The global information is added because some query terms do not occur 
even once in the document of interest. It is assumed that these terms are 
randomly selected from any of the documents in the entire collection. In 
section A. 4 it is shown that this probability measure can be rewritten to 
a tfxidf term weighting algorithm. A somewhat similar probability func- 
tion was used by Miller, Leek and Schwartz [13]. They showed that it can 
be interpreted as a two-state hidden Markov model in which ai and «2 
define the state transition probabilities and P{T) and P{T\D) define the 
emission probabilities. 

The extension for Boolean queries as mentioned above is straightfor- 
ward. Eor each draw, different terms are mutually exclusive. That is, if 
one term is drawn from a document, the probability of drawing e.g. both 
the term information and the term retrieval is 0. Eollowing the axioms 
of probability theory (see e.g. Mood [14]) the probability of a disjunction 
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of terms in one draw is the sum of the probabilities of drawing the sin- 
gle terms. Disjunction of m possible translations Tij (1 < j < m) of the 
source language query term on position i is defined as follows. 

m 

P{Tii U r*2 U • • • U Tim\D) = Y,{aiP{Tij) + a2P{Tij\D)) (2) 

j=i 

Following this line of reasoning, AND queries are interpreted similar as 
unstructured queries defined by equation 1. Or, to put it differently, un- 
structured queries are implicitly assumed to be AND queries. 

Statistical translation is added to these probability measures by as- 
suming that the translation of a term does not depend on the docu- 
ment it was drawn from. If N\,N 2 , - ■ ■ , is a Dutch query of length 
n and a Dutch term on position i has m* possible English translations 
Tij (1 < J < then the ranking as structured queries would be done 
according to equation 3 



n rrii 

P{N^,N2,---.Nn\D) = HE P{Ni\Tij){aiP{Ti,) + a2P{T,,\D)) (3) 

i=lj=l 

The translation probabilities P{Ni\Tij) can be estimated from parallel 
corpora, or alternatively by any of the methods described in section 4. 
Equation 3 is the basis of the structured query runs run3c-f described 
in section 5.3. The experiments only differ in the way the translation 
probabilities are estimated, i.e. the disambiguation method that was used. 

Eor the unstructured query runs run2a-f statistical translation was 
added by making the number of times a query term occurs in equation 
1 proportional to the translation probabilities. Eor run2a and run2b 
translation frequencies instead of translation probabilities were used. The 
translation frequencies or probabilities can be multiplied with the query 
weights of table 5 (see section A. 4). Again, the experiments only differ in 
the way the translation probabilities were estimated. 

A. 3 Parameter Estimation 

In information retrieval it is good practice to use the term frequency 
and document frequency as the main components of term weighting al- 
gorithms. Our probabilistic model does not make an exception. The term 
frequency tf{t, d) is defined by the number of times the term t occurs in 
the document d. The document frequency df{t) is defined by the number 
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of documents in which the term t occurs. Estimation of P{T) and P{T\D) 
in equation 1, 2 and 3 was done as follows: 

Etdfit) 

The value of the unknown parameter «2 was determined by previous ex- 
periments on three different collections including the TREC cross- language 
collection [8] . For the experiments described in this paper we used U 2 = 
0.15 . The value of a\ is determined by the fact that ai -|- ct 2 = 1- 

A. 4 Rewriting to Presence Weighting Algorithm 

Similar to the probabilistic model of Robertson and Sparck- Jones [18] 
probability measures for ranking documents can be rewritten into a for- 
mat that is easy to implement. A presence weighting scheme (as opposed 
to a presence/absence weighting scheme) assigns a zero weight to terms 
that are not present in a document. Presence weighting schemes can be 
implemented using the vector product formula. This section presents the 
resulting algorithms. Rewriting equation 1 results in the formula displayed 
in table 5. It can be interpreted as a tfxidf weighting algorithm with doc- 
ument length normalisation as defined by Salton and Buckley [19]. 




vector product formula: 
query term weight: 
document term weight: 



similarity(Q, D) = E U)qk ' '^dk 

k^l 

Wqk = tf{k, q) 

Wdk = log(l -I- 



tf{k, d) 



«2 s 






Fig. 5. tfxidf term weighting algorithm 



If a structured query is used, the disjunction of possible translations 
as defined by equation 2 should be calculated first. As addition is as- 
sociative, we do not have to calculate each probability separately before 
adding them. Instead, respectively the document frequencies and the term 
frequencies of the disjuncts can be added beforehand. The added frequen- 
cies can be used to replace df{k) and tf{k, d) in the weighting formula of 
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table 5. The resulting ranking algorithm for Boolean queries was intro- 
duced earlier by Harman [4] for on-line stemming. Harman did not present 
her algorithm as an extension of Boolean searching, but instead called it 
’grouping’. A somewhat similar approach for cross-language information 
retrieval was adopted by Ballesteros and Croft [1] by using a ’synonym 
operator’ on term translations with more than one target term equivalent. 
The operator treats occurrences of all words within it as occurrences of a 
single pseudo term whose document frequency is the sum of d/’s for each 
word in the operator. 

If translation probabilities are available, the adding of respectively the 
document frequencies and the term frequencies of the disjuncts should be 
done as a weighted sum with the translation probabilities as weights. 
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Abstract. In this paper, we describe a crosslingual Information Retrieval System (IRS), 
which makes the interrogation of multilingual databases possible. Indeed, the CEA needs to 
be able to process an important amount of multilingual databases and documents, and so we 
had to adapt the IRS we use, SPIRIT (which relies on linguistical and statistical processing) 
to this situation. We have thus set up a crosslingual interrogation system based on the 
indexation of documents containing parts written in different languages and on the bilingual 
reformulation of the query. The latter tries all the possible translations for every significant 
word of the query and the documents are used as filters in case of uncertainty or ambiguity. 
The answers to a query are given in the form of a list of classes of documents ranked 
according to their relevance. This paper describes the application of these techniques to 
crosslingual access to catalogs and bibliographic databases. 



1 Introduction 

In countries where English is not the sole official language, the scientific information 
managed by public and private bodies is conveyed both in English and in vernacular. 
Of course, other languages are used, for instance, in our organization, German, 
Japanese, Russian or Spanish. But this can be considered as a marginal phenomenon 
in comparison with the use of English and French. 

Even for publications issued by our scientists and that constitute a part of our 
organization’s scientific memory, the number of English documents is important. 

A large number of our databases contain both English and French documents: book 
catalogs of our libraries, a catalog of articles from about 3,000 scientific journals, a 
bibliographic database of international reports on energy and a bibliographic database 
of the publications issued by our scientists. 

In addition to this, it is worth mentioning the databases which are built up from Web 
pages, generally in several languages, extracted for a technology watch on a particular 
theme. But this is not the subject of this paper. 

For the majority of users who are likely to connect to our Intranet, the information 
retrieval systems that focus their search upon character strings are difficult to use with 
a good effectiveness. This is why access to information in our organization is based 

S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL’99, LNCS 1696, pp. 294-310, 1999 
© Springer-Verlag Berlin Heidelberg 1999 
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upon a sophisticated information retrieval system using morphosyntactic parsing to 
process both documents and queries. A statistical model is also used to weight 
normalized words. As a result, the answer to a query is a list of documents ranked 
according to their relevance, grouped in classes characterized by the same concept 
intersection with the query. 

This works well for monolingual databases. In the case of bilingual information, 
however, as automatic linguistic processing produces normalized words thanks to the 
dictionaries and syntactic description of a given language (L), a simple application of 
this technique is not straightforward. Indeed, the part of the database which is in 
language L is analyzed correctly and thus well indexed. In the remaining part 
however, the words in the other language are not normalized, except for words that 
seem to belong to the main language (L). In the last case, the result of normalization 
is incorrect. For example, if the database is indexed in French, the word “data” will be 
considered as the word "dater" that means "to date". One can easily imagine what 
kind of strange inferences can be drawn after an automatic expansion of the query. 

Fortunately those cases are not the most frequent ones. Yet, as a whole, search quality 
has not reached the level that could be expected from a sophisticated natural language 
information retrieval system. 

This is why we took in 1990 the initiative of the first crosslingual information 
retrieval European project within the framework of ESPRIT: EMIR (European 
Multilingual Information Retrieval). Part of the results have been introduced into the 
SPIRIT system, a product marketed by T-GID that we use for all our applications. 

The project demonstrated the feasibility of a crosslingual interrogation using a 
bilingual reformulation that tries all possible translations of each query word. In the 
project, the database queried was monolingual. Now, it is not the case in real life, and 
so, since the end of the project until now we have been developing technologies to 
access mixed language databases and improve the quality of bilingual reformulation. 



2 Description of the Data to Be Accessed 

We want to limit this presentation to the problem of data retrieval from catalogs and 
bibliographic databases. Of course, we are also involved in the crosslingual 
interrogation of fulltext databases, both in the case of project memory (all the texts 
interesting a consortium within the framework of a scientific project) and in the case 
of scientific, technological or strategic watch. 

Using catalogs to search information on a given subject is a problem much more 
difficult to tackle than retrieving a known title because of the very little information 
available (titles, and in some cases, controlled or free keywords, and summaries). On 
the contrary, fulltext contains a large number of redundancies and thus offers more 
chances of finding the relevant documents. 
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We will focus our presentation on two different cases. One is the catalogs of our 
libraries. The second is a bibliographic database of the publications issued by our 
scientists. 

The catalog items are filled using the OCLC database. Some local information is 
added. The proportion of French books is 12 %, English books 75 %, Russian books 
3.3 %, and German books 2.3 %. The rest is negligible. 

The main problem to be tackled is the following : even if our source is the OCLC 
catalog, the content description is not in only one language. Titles are in the original 
language (mainly French or English) but there are keywords in a different language 
than the original one, for instance English keywords in the case of French books. 
Keywords in French are assigned by the Library of France which chooses them from 
the Rameau Thesaurus. Of course, English and French keywords are not translations 
of one another. 

Example : 

Title : Automating library procedures, a survivor’s handbook 

Author : Ian Lovecy 

English keywords : Libraries, Automation, Library science. Data processing. Library, 
Administration 

French keywords : Bibliotheques, Informatique, TRAITEMENT DE 
L’INFORMATION 



The bibliographic database of CEA scientists’ publications contains original titles. 
The database content is keyed in directly by researchers' scientific departments. They 
can add the title in the other language (French if the document is in English or English 
if the document is in French), a summary in French and/or English and free keywords 
in French and/or English. 

When summaries or keywords are in both languages, they are not always translations 
from one another. 

Given the two following facts : 

the bilingual parts of the description do not always result from a translation, 
linguistic processing is not a technology giving 100% of right results. 



it is preferable to make use of all the information available for document access. As a 
result, if the system failed to find a document using French information, the same 
document could be retrieved using English. That is the reason why we have decided 
to build up an architecture that can exploit the full information available in each 
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document description so as to compare it with the query, generally expressed in one 
language. The semantic intersection between the query and documents is reached 
taking into account both languages. 

Referring to both languages at the same time ensures an efficient ranking of mixed 
language documents. 



3 Crosslingual Approach of the EMIR Project 



3.1 Short State of the Art 

There are three main approaches to crosslingual interrogation of text databases. 

The first one is based upon a statistical approach assuming that a number of 
documents of the databases are in both languages. In some cases documents do not 
necessarily have to be translations from one another, provided that they are identified 
as being related to the same topic or event, as it is the case with newspaper articles. 

The most well known statistical model is the vector space model of Gerard Salton [2]. 
In this model the database is represented as a vector space which has as many 
dimensions as the number of different words in the database. Documents or queries 
are vectors in this space. A proximity is evaluated between a document and a query 
by computing the cosine of the angle between the two vectors. 

An improved model is the latent semantic indexing one (LSI) [3]. By reducing 
dimensions to a few hundreds, it achieves a sort of implicit reformulation both on 
each language and between the languages involved. 

The results are good but suppose that translated documents exist. 

The second approach is based on the translation either of the queries or of the full 
database (generally the query) using a machine translation (MT) system. The 
drawback of such an approach is due to the fact that MT systems provide one 
translation for each word : if they give the wrong translation (as it is often the case for 
polysemic words), the query result may be bad. 

The third approach, which is ours, consists in using a bilingual reformulation that tries 
any possible translation for each word. The ambiguities are solved if there is an 
answer to the query by using the relevant documents as a semantic filter to choose the 
right translation. Of course, this is only the case if the query contains several words. 
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3.2 SPIRIT’S Basic Principles 

The EMIR (European Multilingual Information Retrieval) project was based on the 
technology of an existing product, the SPIRIT System. SPIRIT stands for "Syntactic 
and Probabilistic Indexing and Retrieval of Information in Texts". 

SPIRIT components are the following: 

Fig. 1 . SPIRIT Components 



Query in natural 
language 



Documents 




Representation of the 
query 



Data storage 









Comparison 



IT 



Answers grouped in 
relevance- ranked 
classes 



The linguistic processing consists of a morphosyntactic parsing that assigns a part of 
speech to each word, recognizes idiomatic expressions using a dictionary, normalizes 
words through lemmatization, recognizes dependency relations (especially those 
involved in compounds), identifies general language synonyms (ex : Uranium 235, 
U235, U 235, U5) and solves some homographic problems (especially those that can 
be solved by syntactic parsing like "train" verb and noun). At the end, empty words 
are rejected according to their part of speech (prepositions, articles, punctuation, etc.). 
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The statistical model is used to give the user a list of documents sorted according to 
their degree of relevance. The SPIRIT model differs from the vector space model 
because it assigns a weight to each word of the database according to its 
discriminating power, but does not assign a weight to each word in each document. 

This point is important because the aim of the system is to find relevant information 
even in documents in which the main content is out of scope of the query. 

The SPIRIT system can be considered as a weighted Boolean system. This means that 
documents are grouped into classes characterized by the same concept intersection 
with the query. The classes constitute a partition of the database, which implies that 
each document is in the best class it can be. Each class of intersection is characterized 
by a boolean query where only "AND" and "AND within a dependancy relation" are 
used as operators. Of course the second operator is considered as better than the 
"AND" one. In the following example, documents of the first class can be also 
accessed by the queries characterizing the following classes. But the document is only 
put in the best class (that means the first one in the decreasing order of relevance) it 
can be in. 

Example'. 

Question', management of radioactive wastes 
Result'. 

First class: 166 documents, characterized by "management of radioactive wastes" 
Second class: 62 documents, characterized by "management of wastes and 
radioactive" 

Third class: 4 documents, characterized by "radioactive wastes and management" 
Fourth class: 44 documents, characterized by "management of wastes" 

Fifth class: 356 documents, characterized by "radioactive wastes" etc. 

Reformulation is used to infer from the original query words other words expressing 
the same concepts. Reformulation can be in the same language (synonyms, 
hyponyms, etc...) or in a different language (bilingual dictionary). 

The comparison tool is used to perform a quick evaluation of all possible intersections 
between documents and query words (including those inferred through 
reformulation), and to compute a relevance weight for each document. For 
information retrieval the class weight only depends on the query/document 
intersection. The weight of each concept (that means all words inferred from one 
original query word) depend only on the number of document where it occurs. 

This approach differs from a standard query expansion because the intersection is 
based on the original query words and not on the set of words inferred from the query. 
That is why we assume it as a concept intersection instead of a word intersection. 
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3.3 General Principles of Crosslingual Interrogation of SPIRIT Using Bilingual 
Dictionaries; 

Any possible translations are inferred from the original query words. Some of these 
translations are not in the database and are naturally dropped. For the rest, they are 
fdtered by the database itself as their co-occurrences are taken into account. 

In fact, if a document is relevant to the query, the intersection between the concepts 
expressed by the query words and those expressed in the documents will be broad. In 
this case the co-occurrence of translations from a maximum of different query words 
gives the quasi-certainty to get the right translations of the query words. 

Example : 

If the query contains “ free space ” where “free” is tagged as an adjective and “space” 
as a noun by the syntactic disambiguation: 

The possible translations of "free" (adjective) into French are: gratuit, large, libre. 

The possible translations of "space" (noun) into French are: blanc, espace, place, 
periode. 

The only translations of “free space” that can be found in the text database are 
“espace libre” or “place libre” which are both acceptable translations. 

Of course, the dependency relations are taken into account. Indeed, generally a two 
words compound in a dependency relation can have more than one hundred 
translation candidates, using word for word translation and a reordering of the 
translated words - but only few of them can occur in the database with the same 
dependency relation. More details can be found on the EMIR’s crosslingual 
interrogation in [4]. 
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Phrase Bilingual 

reformulation 



Spectrometre 

a I 




spectrometer 

time 

stroke 

beat 

stage 

tense 

weather 

flight 

flock 

theft 



Filtering 
database 
lexicon 
spectrometer 

time — 
stroke 
beat 
stage 

weather 

flight flight 



Dependency 
relations 
recognition 
— spectrometer 

time 



Vertical links represents dependency relations. 

This schema shows the different steps that are used to infer the right translations using 
the database as a semantic filter. At the first stage, the system infers any possible 
translations of all single words or compounds (if there is a global reformulation for a 
compound). In our example there is no reformulation of compounds because they can 
be translated word to word. 



After that, translations that do not appear in the database are eliminated. For example, 
"theft" and "flock" for "vol", "tense" for "temps". 

To find the right translation of compounds translated word to word, it is necessary to 
use transformational rules to eventually reorder the words in the target language. That 
means for example that the system try to find either "time of flight" and "flight time" 
as possible translations of "temps de vol" 

The last filtering step is done by examining the translations appearing in the "best" 
documents that means the document having the greatest intersection with the query. 
The "best" class of intersection is characterized by the words in dependency relation 
"spectrometre-temps-vol". Examining the content of documents written in English, it 
can be seen that the only translation found is "time of flight spectrometer". 

Two more steps are needed. They have been tested in a prototype but not at this time 
included in the commercial product that is used in our applications and for this test. 

The first one consist of a feedback to eliminate the translations that are not in the best 
documents. In our example 

“stage, stroke, weather, beat” can be eliminated as translation of "temps" 
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The second one is the use of a second refomiulation step using monolingual target 
lexical knowledge. In our case, this reformulation can infer, for example, 
"spectrometry" from "spectrometer". This will give a better answer to the query. 



4 Accessing Mixed Language Databases 

The architecture developed as part of the EMIR project is not sufficient to access 
mixed language databases. The last version of SPIRIT is now able to process mixed 
language databases, provided that the information in different languages is kept in 
different fields. With this version, each language is correctly indexed. But it is 
necessary to enter separate queries in both languages and manually mix the results. 

Because the preceding procedure is “user unfriendly”, we have developed a special 
crosslingual interrogation system, that can give mixed language results. 

A query in one language is used to retrieve data from the part of the database which is 
in the same language. In parallel, the same query can be used to access the part of the 
database which is in the other language. This is performed through the crosslingual 
tools of the system. 



Fig. 2. Crosslingual Interrogation 




A major point is that the system keeps the link between inferred words and the 
original query words. Consequently, it is possible to merge the results of the two 
queries. 
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Each class of documents is characterized by a particular intersection with the query. 
In relation to the preceding remark this intersection can be expressed with the original 
query words even for querying in the other language. It is thus easy to merge the 
classes of intersections between the two results. In some cases (for a document with 
more than one language), some classes disappear and some others are created. 



For example : 

With a query like “gestion des dechets nucleaires” (management of nuclear wastes), 
the intersection with the English part of the document can be “dechets nucleaires” 
(nuclear wastes) and with the French part “gestion des dechets” (management of 
wastes). By merging the results, a new class can appear to place this document, now 
characterized by “gestion des dechets nucleaires”, and some classes in which this 
document was can disappear. 

If in the merging process the first problem to be solved is the computation of new 
classes of intersection, an other problem lies in the weighting of words. As weights 
are calculated for each language, we have to compute a new weight corresponding to 
the occurrence of the same concept in different languages. 

In the case of SPIRIT, the weight of words is linked to the number of documents they 
appear in. 

It is not realistic to aim at a global weighting of the bilingual database since there is 
no link between words in context in a given language and their correct translation in 
an other language during the indexing process. Only in the context of a query can the 
ambiguity be solved. 

The solution we have chosen for this problem of weight recomputing is to consider 
the full answer to the query as a database. In this case it is possible with the same 
algorithm to merge the classes and recompute a common weight, because we have 
information on the number of documents a concept occurs in its different linguistic 
forms. 

The experiments done show that this palliative solution gives good results. 

Example of use on the CEA's publication database 

Let us give an example based upon Intranet querying of the bibliographic database of 
publications issued by CEA’s scientists. 

La conservation des aliments par irradiation. 

Result (In the class definition, the comma stands for the AND operator and the 
hyphen means a dependency relation between two words appearing in both the query 
and the documents) : 
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1*^* class : conservation-aliments, irradiation, 2 documents 

- Food preservation by irradiation : a brief introduction 

- Le traitement ionisant des aliments.. 

2”'* class : aliments, irradiation, 2 documents 

- Devenir de I’ionisation des aliments 

- Flux des radionucleides dans les productions agricoles suite a un accident nucleaire 

The first document of the first class contains : 

In French : conservation des aliments, 

In English : food preservation, irradiation, radiations 

Remark : "conservation des aliments" has inferred "food preservation" by a word- 
for- word translation followed by a reordering of the words according to 
transformational rules. 

Irradiation has inferred “irradiation” and “radiation”. 

But the dependency relation between "conservation" and "irradiation" has not been 
taken into account in the French parsing, so the relation between "preservation" and 
"irradiation" has not been recognized in the English part of the document. 

The second document of the first class contains : 

In French : conservation des aliments, irradiation, rayonnement 

In English : radiation 

Remark : the word "rayonnement" has been inferred by the French monolingual 
reformulation. 

The first document of the second class contains : 

In French : aliments, alimentaires 

In English : food irradiation 

Remark : because the relation between “aliments” and “irradiation” has not been 
recognized in the French query, the compound “food irradiation” has not been 
recognized as such. Consequently, this document, which should have been put into a 
class characterized by "irradiation des aliments", is in fact inserted into a class only 
characterized by the co-occurrence of ’’aliments" and "irradiation" containing 
documents on the contamination of food by radionuclides. 

In the last case a relevant document has been put into a class characterized by : 
aliment. 

This is due to the fact that the document contains "aliments ionises" in French and 
"irradiated food" in English. French reformulation has not inferred "ioniser" from the 
word "irradiation". As mentioned above , the commercial system only uses a one- step 
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reformulation at the present time. As a result, even if the relation between 
"irradiation" and "irradiate" is stored in a monolingual English reformulation rule, it 
is not possible to activate it. 

Querying in English approximately gives the same results. Yet, the first document is 
characterized by the full compound that has been handled appropriately in English. 



5 Problems Limiting the Quality of the Results 

There are a large number of problems that limit the quality of crosslingual 
interrogation. Among them are the following : 

Idiomatic expressions that are noncontiguous are not taken into account in the 
commercial product (verbs with particles in English, German, or idiomatic verbs 
in French) examples : switch on (EN), abfahren (GE), mettre en oeuvre (FR) 

Some compounds of more than 2 parts are not well analyzed in the commercial 
product. As a result, compounds are not recognized in the same way in the 
different languages, so the ranking of relevant documents is not the same 
according to the query language (see example in §4). 

The lack in the commercial product of a multistep reformulation performing a 
monolingual target reformulation after the bilingual one. (example to obtain 
"couronne en or" a partir de "golden crown" it is necessary to infer "dore" from 
"golden" in the bilingual reformulation step. Then "or" from "dore" by a target 
language monolingual reformulation. 

Yet, the main problems are connected with the coverage and consistency of the 
linguistic (monolingual and bilingual) dictionaries that are used : 

inconsistency of coding between monolingual dictionaries and bilingual ones 
lack of entries 

lack of translations for an existing entry 

lack of compounds that cannot be translated word to word 

lack of symmetry of bilingual dictionaries 

This is why we have focused our efforts on quality control and vocabulary coverage 
in our scope to improve the linguistic resources of our system. 

More details can be found in [5]. 



6 Different Ways to Enhance the Quality of the Linguistic 
Resources 

To enhance the quality of our linguistic resources we have developed programs that 
control the integrity of the linguistic resources. This means : 
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verifying that a source word of a bilingual dictionary is a normalized word 
produced by the monolingual dictionary, 

verifying that target words of a bilingual dictionary are normalized words 
produced by the monolingual dictionary, 

verifying that each compound is an idiomatic expression or a compound 
recognized through syntactic parsing because the normalization is different 
("chemin de fer" is an idiomatic normalized expression and the compound 
"chemin traverse" is the normalized form of "chemins de traverse". 

We have also started a year ago a campaign to increase the vocabulary coverage of 
the dictionaries in our scope, which ranges over a number of fields - from nuclear 
technology to astrophysics through medicine, chemistry, and semiconductors. 

Two main sources have been exploited : 
the vocabulary of our databases 

the vocabulary from the log files resulting from queries to our Intranet and 
Internet applications. 



The vocabulary has been extracted and compared to the monolingual dictionaries. 
Normalized vocabularies from the database indexes have been compared to the 
bilingual dictionaries. 

An automatic extraction of compounds has been achieved to study the problem of 
idiomatic expression coding and suggest expressions to be translated as a whole. 
Example of an extraction of compounds in the French and English part of the 
database of the publication issued by our scientists : 



Occurence 


French 


Occurence 


English 


number 




number 




140 


resultat obtenus 


/ 351 


cross section 


138 


temps reel 


// 


experimental result 


125 


champ magnetique 


“A 


magnetic field 


122 


etude experimental \ 


/ /218 


heavy ion 


122 


acier inoxydable 




experimental datum 


114 


simulation numerique\/ y 




stainless steel 


113 


resultat experimental A A 




good agreement 


91 


microscopic electroniqui^ 




experimental study 


83 




122 


neutron diffraction 


80 




119 


magnetic structure 


80 






magnetic property 


77 






swift ion 


77 


produit de fission 




room temperature 


74 






high temperature 


74 






^ fission product 


68 






nuclear plant 


68 


joint de grains 


87 


numerical simulation 
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It can be easily seen that if the repartition of themes is equivalent between documents 
in French and in English, word that are translated from one another have comparative 
rank in the list sorted by frequencies. 

From the preceding list, we can infer that "cross section" must be translated globally 
by "section efficace" because "efficace" is not the translation of "cross". 

Another source is the INIS (International Nuclear Information System) Thesaurus. 
This thesaurus has been used to extract synonyms for monolingual reformulation as 
well as English- French keyword correspondences for bilingual dictionaries. 

The following problems have been encountered : 

Words are in upper case characters without accents for French. It was necessary 
to add accents and rewrite words into a full upper case-lower case set of 
characters. 

A number of keywords are in the plural form, which is not compatible with the 
normalized forms of SPIRIT. For example "JEUNES" is used for "jeunes" = 
teenagers and "JEUNE" is used for "jeune" = diet. 

Some relations are of no interest for SPIRIT because they are automatically taken 
into account by linguistic processing. 

Example : 

The relation between a broader term and a narrower term when it is a relation between 
a noun phrase and its head (Example : “fdter” and “air fdter” or “electric fdter”) or 
the relation between synonyms that are different representations of the same word 
(Example : “ashing (dry)” is equivalent to “dry ashing”). 

Concerning compounds, English-French translations may not be true translations but 
mere explanations of English terms, so this kind of information cannot be used to 
build up reformulation rules. 

Examples : 

Carcinogen ^ produit chimique cancerogene 
Chemosterilant agent chimique sterilisant 
College etablissement d'enseignement superieur 

Forestry valorisation des produits forestiers 

All the translations or synonyms produced by the above processing are manually 
controlled by our terminologist before addition to the existing dictionaries. 

The most difficult problem is to discover compounds that require to be translated as a 
whole. In order to tackle this problem, we have started a research program a year ago 
to produce an automatic extractor of compound translations from existing translated 
texts. This system establishes a correspondence between sentences of both texts. It 
starts from a sentence and its translation, basing itself upon their syntactic analysis 
and the links established thanks to the existing dictionary, and proposes links between 
words or compounds that had not yet been related. 
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At this time the sentence alignment program has been developed and tested as part of 
the ARCADE international common evaluation program [10]. Because we have 
privileged precision, and not recall, our results are not the best, but our approach 
based upon the use of crosslingual interrogation for alignment shows the best standard 
deviation. Our approach therefore seems to be more robust and giving a minimum 
difference between the best results and the worst ones. 

We are developing word alignment at the present time. 



7 Preliminary Evaluation 

Comparative evaluation of effectiveness using a common database and set of queries 
is a very huge work. That is the reason why it is generally a common investment of 
the world research community like for TREC (organized by NIST USA) or 
AMARYLLIS. (TREC-like evaluation on French databases done by the French 
Speaking Countries University Organisation (AUF)) 

Getting a quantitative evaluation of effectiveness between the actual application that 
uses a monolingual indexing in French for both English and French part of the 
information (ML) and the bilingual indexing and retrieval which is proposed in this 
paper (BL) , should be of a great interest. 

Unfortunately, the amount of work to get a significant statistical result was too 
important. That is the reason why we will only give some qualitative results. 

Because the ML application is in exploitation, we have a log file of real queries from 
users. We have begun to systematically execute these questions and their translation 
in the other language using the ML application and the BL application. 

The first results show different kind of situations : 

the system gives good translations of the query. In this case and provided the fact 
that there is few bilingual documents the result of the BL system is much better 
than the ML one. 

For example : query : "food preservation by irradiation" gives three relevant 
documents at the three first positions in BL. ML give only one relevant document. 

The same query in French : "conservation des aliments par irradiation" gives three 
relevant documents at the three first positions. In ML, 2 relevant documents are in the 
2 first positions, the third relevant document is at the 8* position. 

The difference of effectiveness is strongly influenced by the proportion of relevant 
documents that have parts in the both languages or not. 




Crosslingual Interrogation of Multilingual Catalogs 309 



the query uses an expression that cannot be translated word to word and this 
expression is not in the bilingual reformulation dictionary. The results of ML and 
CL are similar. 

For example, a query on "courant de Foucault" in French gives the same answer in 
ML and BL (14 documents) because the translation is not known by the system. The 
result of "eddy current" is better in BL (18 documents) than in ML (12 documents) 
only because of the lack of normalization in English which is not the language of 
indexing for ML. 

After introduction of the reformulation "courant de Foucault" <-> "eddy current" with 
BL, 22 relevant documents are retrieved on the top of the list both with the French 
and the English query 

As a conclusion, the increasing of effectiveness varies but is always positive. With the 
work done on the updating of bilingual dictionaries especially for compounds, the 
effectiveness will be better and better. 



8 Conclusion 

Even if the system at this time is still not perfect, the results are much better than 
those obtained when indexing documents in only one language. Conditions of use are 
also greatly improved through asking one query and getting one list of ranked 
documents, as opposed to querying in both languages successively and performing a 
manual merging of ranked results. 

The interrogation of the bilingual catalogs and the C.E.A.’s scientists publications is 
subject to quality test. We hope to be able soon to propose an access to this Intranet 
application to our users. 

Of course, a great deal has still to be done to improve this technology. Part of this 
improvement has been tested in prototypes, but has not yet been integrated into the 
commercial product. This involves the introduction of noncontiguous idiomatic 
expressions, multistep reformulation, and a full recognition of dependency relations. 

Further improvements mean lasting efforts to enhance quality, consistency, and 
exhaustiveness not only of the monolingual dictionaries used for query and document 
parsing but also of the monolingual and bilingual rules required for the reformulation 
process. 
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Abstract. We propose a query expansion technique which is based on a 
statistical similarity measure among terms to improve the effectiveness of the 
dictionary-based cross-language information retrieval (CLIR) method. We 
employ a term similarity-based sense disambiguation technique proposed in our 
earlier work to enhance the accuracy of the dictionary-based query translation 
method. The query expansion technique is then applied to the translation of 
queries to further improve their retrieval performance. We demonstrate the 
effectiveness of the two techniques combined using queries in three languages, 
namely, German, Spanish, and Indonesian, to retrieve English documents from a 
standard TREC (Text Retrieval Conference) collection. The results of our 
experiments indicate that the term similarity-based techniques work better when 
there are more phrases in the queries. In addition, our results also re-emphasize 
other researchers’ finding that phrase recognition and translation are critical to 
CLIR’s effectiveness. 



1 Introduction 

The increasing availability of digital documents from all around the world accessible 
through the Internet has allowed Internet users to access text materials written in 
foreign languages. Today, there are many Internet search engines that make searching 
for documents easier using a query containing a set of keywords. Most of these search 
engines can locate documents in various languages as long as the query is written in 
the same language as the target documents, referred to as monolingual searching, or the 
documents contain at least one of the keywords, e.g., proper names. It is a challenge 
when the sought-after documents are written in a language that is foreign to the user, 
and none of the query words are contained in the documents. This challenge has 
contributed to the increasing popularity of Cross-Language Information Retrieval 
(CLIR) among researchers in the Information Retrieval (IR) community in recent 
years. 

Research in CLIR explores techniques for retrieving documents in one language in 
response to queries in a different language. The most obvious approach to CLIR is by 
either translating the queries into the language of the target documents or translating 
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the documents into the language of the queries. Translating documents is obviously a 
very expensive task since there are typically so many documents in a collection and 
each of which may be very long, and above all, there is not yet a reliable natural 
language translation system that can be used to automate the process. For these 
reasons, many researchers in this field, the authors included, opted to take the query 
translation approach. 

There is a number of ways to perform query translations, namely, by 1) employing 
machine translation techniques [12], 2) using parallel corpora [19, 20], or 3) using 
bilingual dictionaries [1, 3, 7, 10]. The first two approaches are very labour intensive. 
Manual hand-coding of linguistic, semantic and pragmatic knowledge is required for a 
machine translation engine to produce good translations. This can be quite 
overwhelming when the domain of coverage is wide. Likewise, a great deal of work is 
also required for building parallel collections, i.e., for manually translating each of the 
documents in the collection into its equivalence in another language. 

The third approach, dictionary based translation, has recently become very practical 
with the increasing availability of machine-readable bilingual dictionaries. In addition, 
the topic coverage of this technique is less limited than that of parallel corpus since a 
dictionary typically contains a wider variety of terms than a sample corpus. Ftowever, 
the effectiveness of this approach depends on its ability to select the right sense of a 
word from many possible senses provided by the dictionary. In our earlier work [2], we 
proposed a technique for selecting the best sense of a query term from all possible 
senses given by a dictionary based on statistical term similarity among the senses. In 
this work, we employ the same technique but enhanced with a query expansion 
technique which is also based on the term similarity measure. Unlike the query 
expansion technique used in our earlier work, the current technique selects terms, to be 
added to the queries, based on the collective similarity between each candidate term 
and all of the existing terms in the query. 

We hypothesize that there are differences in the pattern of term distribution among 
different languages, which may correlate with the effectiveness of our techniques in 
those languages. For this reason, we have conducted a series of experiments using a 
query set transcribed in three languages, namely, German, Spanish, and Indonesian to 
retrieve documents in English. We then analyzed the retrieval effectiveness of our 
techniques for each of the languages and investigated the cause of variations. 

In Section 2, we present a brief survey of relevant work done by other researchers. 
Section 3 describes our term similarity based query expansion technique, and provides 
a review of our sense disambiguation technique. Section 4 discusses the experiments 
that we conducted to measure the effectiveness of our techniques and their results. 
Finally, Section 5 concludes this paper with a summary. 
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2 Related Work 

Some of the earliest work in CLIR was done by Salton [17] and Pevzner [13] who used 
thesauri to index and retrieve documents written in multiple languages. Their results 
showed that the effectiveness of cross-language retrieval was almost the same as that 
of monolingual retrieval. Since then, research in CLIR has grown to cover a wider 
variety of languages and techniques. 

A technique for translating queries indirectly using parallel corpora has been 
proposed by Sheridan & Ballerini [19, 20]. A parallel corpus is a collection of the same 
or equivalent set of documents written in two (or more) languages. These authors 
employed a term similarity thesaurus built by comparing the occurrences of terms in 
each pair of equivalent documents written in Italian and German. Their German 
queries were substituted with Italian queries constructed from the terms found in the 
Italian documents parallel to the relevant German documents [19]. 

Query translation using bilingual dictionaries has been much studied by researchers 
in the field [1, 3, 6, 7, 10]. As mentioned in the previous section, dictionary based 
translation is prone to errors due to the high possibility of selecting the wrong sense 
(meaning) of a term among the senses provided by the dictionary. As a result, queries 
translated using this method typically perform worse than the equivalent monolingual 
queries - referred to here as monolingual retrieval performance. This is called the 
ambiguity problem in CLIR. Another problem associated with the dictionary-based 
method is the problem in translating compound-noun phrases in a query. The problem 
stems from separately translating words belonging to such a phrase, word-for-word, 
resulting in a sequence of words with incompatible meanings. 

A technique that can be used to alleviate the impact of the above problems is by 
identifying phrases in the query and translating them using a phrase dictionary. Such a 
technique has been shown to improve CLIR performance. Hull & Grefenstette [10] 
demonstrated that the retrieval performance of queries produced using manual phrase 
translation was significantly better than that of queries produced by simple (word-for- 
word) dictionary-based translation. Davis & Ogden [8] showed that phrase translation 
using a phrase dictionary built by extracting phrases from parallel sentences in French 
and English documents improved the performance of their dictionary-based CLIR. 
Work in term-sense disambiguation has been done by Ballesteros & Croft [4] who 
demonstrated the effectiveness of a phrase recognition and translation technique which 
is based on the co-occurrence of terms in the target document collection. A different 
approach is proposed by Pirkola [14] whose technique reduces the effect of the 
ambiguity problem by structuring the queries and translating them using a general and 
a domain-specific dictionary. 

To further mitigate the negative effect of mistranslated query terms, many 
researchers have employed query expansion techniques. Query expansion is a well- 
known method in IR for improving retrieval performance. Basically, it adds new 
terms, selected using a certain technique, to the query such that the query becomes 
more precise where the added terms clarify the meaning of the original query terms, 
and its recall is improved as terms associated with the original query terms are added. 
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In monolingual IR, Sparck Jones [21] proposed a query expansion technique which 
adds terms obtained from term clusters built based on co-occurrences of terms in the 
document collection. Qiu & Frei [15] have done similar work, but using term 
similarity, instead of term co-occurrence. In CLIR, Ballesteros & Croft [4], and 
Carbonell et al. [6] employed pseudo relevance feedback techniques to obtain terms for 
the query expansion. The pseudo relevance feedback techniques assume that the top 
rank documents initially retrieved using the queries are relevant. Terms appearing in 
these relevant documents are then added to the queries. Ballesteros & Croft [3] 
proposed pre-translation, post-translation and a combination of post and pre-translation 
query expansion techniques based on term co-occurrence. They found that post- 
translation query expansion, i.e., query expansion on the translated queries, and the 
combination-translation query expansion, i.e., query expansion on both the original and 
the translated queries, are effective in improving CLIR performance. 



3 Algorithms 

3.1 Term Disambiguation 

In our earlier work [2] we proposed a term-sense disambiguation technique for 
selecting the best translation sense of a word from all possible senses given by a 
bilingual dictionary. Basically, given a set of original query terms, we select for each 
of the terms the best sense such that the resulting set of selected senses contains senses 
that are closely related- or statistically similar- with one another. This is done using an 
approximate or pseudo-optimal algorithm. Given a set of n original query terms {t^, 

..., tj, a set of translation terms, T, is obtained using the following algorithm: 



1. For each t, (;-l to «), retrieve a set of senses S. from the dictionary. 

2. For each set to n), do steps 2.1, 2.2 and 2.3. 

2.1 For each sense ij=\ to j^J) in S„ do step 2.1.1 

2.1.1 For each set 5^ (^1 io n Sl k <> i), get the maximum similarity, Af ^, 
between t.’and the senses in S^. 

2.2 Compute the score of sense Gas the sum of M.^{k^l ton 8c koi). 

2.3 Select the sense in 5, with the highest score, and add the selected sense into the 
set T. 

3. End 



Query terms that are not found in the dictionary are included in the translation set T 
as-is. This is typically the case for proper names, technical terms, and acronyms. 

We obtain the degree of similarity or association-relation between terms using a 
term association measure, called Dice similarity coefficient [16], which is commonly 
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used in document or term clustering. The term association measure, sim^, between 
term x mdy is computed as follows: 



sitn — 2Yj iyv’ ■ • w’.)/(Xvv,^ + Zw ^ ) 

xy " \ XI yi ^ ^ XI ^ yi / 

1=1 1=1 1=1 

where 

= the weight of term x in the document i 
w^^ = the weight of term y in document i 
w T = w^. if term y also occurs in document i, or 0 otherwise 
w ' . = w , if term x also occurs in document i, or 0 otherwise 
n = the number of documents in the collection. 

The weight of term x in document i is computed using the standard tf*idf term 
weighting formula [18]. 



3.2 Query Expansion 

As described previously in Section 2, query expansion has been known to be effective 
in improving the retrieval performance of translation queries. In this work, we pursue 
the same direction but using a query expansion technique different from those used by 
previous researchers. Our query expansion technique adds to a given query terms 
which are highly similar, in terms of statistical distribution, to all of the terms in the 
query. We measure such a quality of a term, referred to as the term proximity score, by 
computing the sum of the similarity scores between the term and each term in the 
query. First, we compute the term proximity score of every term in the collection with 
respect to each term in the query. Given a query q containing m terms, the proximity 
score, of term x with respect to query q is computed as follows: 



p,, = id£ I (sim,. • idf) / I idfj 

j=l 7=1 

where 

sim^^. = the similarity measure between term x and query term j 
idf^ = the inverse document frequency (idf) of term x 
idf. = the idf of query term j 
m = the number of terms in query q. 

idf^ is computed as log {nidf), where df is the number of documents containing 
term x in the collection. Next, terms whose proximity scores are above a threshold 
value are then selected and added into the query as expansion terms. The optimal value 
of the threshold is collection dependent, and is established through experiments. 

In practice, only terms whose similarity scores, sim^., are non-zero are included in 
the above computation. To minimize redundant computations in both the sense 
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disambiguation and query expansion techniques, we use a term-similarity matrix as a 
lookup table. This matrix is built using the target corpus. Making use of the target 
corpus, instead of a separate training collection, for this purpose is very practical since 
statistical parameters, such as term frequency {tf} and df, are easily obtainable from the 
text retrieval system used to index the documents. 



4 Experiments 

To measure the effectiveness of the sense disambiguation and query expansion 
techniques we conducted a series of experiments using an English corpus. The corpus 
contains 748 MB of the TREC (Text Retrieval Conference [8]) Associated Press (AP) 
collections. We used the German and Spanish versions of the TREC’s 24 queries for 
cross-language topics to retrieve relevant documents in the English corpus. In addition, 
we also used an Indonesian version of the queries which were constructed by manually 
translating the queries. The translation was done by one of the authors whose native 
language is Indonesian. The example below shows the different language versions of 
one of the queries. 



English: Reasons for controversy surrounding Waldheim's World War II actions. 
German: Grund der Polemik um die Tatigkeit Kurt Waldheims im Zweiten Weltkrieg. 
Spanish: Razones de la controversia que rodea las acciones de Waldheim durante la 
Segunda Guerra Mundial. 

Indonesian: Alasan munculnya kontroversi tentang tindakan-tindakan Waldheim pada 
Perang Dunia II. 



Next, we used a subset of the TREC AP collections, i.e. the AP89 collection, to 
build an English term-similarity matrix. We used only a subset of the collection 
because computing the term-similarity matrix for the entire corpus is too costly in 
terms of computation time and memory space requirement. 

In the experiments, we used three bilingual dictionaries, namely, the Echols 
Indonesian-English dictionary, the Collins Spanish-English dictionary, and the 
Langenscheidt German-English dictionary to translate the Indonesian, the Spanish, and 
the German queries, respectively, into English. The query translation process was 
performed manually. First, phrases or sequences of words in the queries that can be 
found in the dictionary were replaced with their translations according to the 
dictionary. Next, each of the remaining query terms was then substituted with any 
translation senses found in the dictionary for that term. Query terms that were not 
found in the dictionary were left unchanged. Next, we then applied the sense 
disambiguation technique to select the best sense word among the possible senses for 
each query term. Finally, we applied the query expansion technique to the resulting 
queries. 

We measured the effectiveness of our techniques in terms of average retrieval 
precision which was computed using the standard 11 recall-point measurement for 
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TREC. Our CLIR method uses an off-the-shelf IR system for indexing and retrieving 
the documents. In this work we use the INQUERY system [4]. In the experiments, we 
compared the retrieval effectiveness of four CLIR methods, namely, (1) monolingual 
method which uses the original English TREC queries, (2) simple translation method 
which uses queries produced by including all possible translation senses in the 
dictionary for each query term, (3) automatic disambiguation method which employs 
the sense disambiguation technique, and (4) expanded method which employs the 
query expansion techniques. 



4.1 Result 

As can be expected, the average retrieval precision of the queries produced using the 
simple translation method for the three languages were much worse than that of the 
equivalent monolingual method. As shown in Table 1, the average retrieval precision 
of the English translation queries (queries translated into English) of the German, the 
Spanish, and the Indonesian queries are, respectively, 65.77%, 78.47%, and 57.41% 
below that of using the monolingual method. The increases in the number of query 
terms as the result of the simple translation method contributed to the poor 
performance. Each of the original English TREC queries, on the average, contains 
5.46 terms (excluding the stop-words). The average resulting English query obtained 
from translating the German, the Spanish, and the Indonesian queries contains 35.21 
terms, 37.25 terms, and 17.75 terms, respectively. The average number of terms per 
query in the German, the Spanish, and the Indonesian queries are 5.67 terms, 6.33 
terms, and 5.79 terms, respectively. Applying the sense disambiguation technique, the 
average retrieval precision improved by 48.20%, 40.68%, and 27.96% for the German, 
the Spanish, and the Indonesian queries, respectively. 

We then applied the query expansion technique to improve the retrieval 
effectiveness of translation queries. For each of the query sets, we first performed a 
series of preliminary experiments to establish the best threshold values that maximize 
the queries’ retrieval effectiveness. 

Table 1. Average retrieval precision (and performance drop compared to 
monolingual) of the English queries translated from the German, the Spanish, 
and the Indonesian queries for the simple translation and the automatic 
disambiguation methods. 



Language 


Baseline 


Simple 

translation 


Automatic 

disambiguation 


Monolingual 

(English) 


0.2804 


- 


- 


German 
to English 


- 


0.0960 

(-65.77%) 


0.2312 

(-17.57%) 


Spanish 
to English 


- 


0.0604 

(-78.47%) 


0.1745 

(-37.79%) 


Indonesian 
to English 


- 


0.1194 

(-57.41%) 


0.1978 

(-29.45%) 
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Applying the query expansion technique to the queries produced using the simple 
translation method resulted in significant performance improvements. As can be seen 
in Table 2, the improvements were 2.58%, 12.14%, and 15.43% for the German, the 
Spanish, and the Indonesian query sets, respectively. 



Table 2. Average retrieval precision (and performance drop compared to 
monolingual) of the English queries translated from the German, the Spanish, 
and the Indonesian queries for the simple translation and the expanded simple 
translation methods. 



Language 


Baseline 


Threshold 


Simple 

translation 


Expanded 

simple 

translation 


Monolingual 

(English) 


0.2804 


- 


- 


- 


German 
to English 


- 


0.7 


0.0960 

(-65.77%) 


0.1032 

(-63.19%) 


Spanish 
to English 


- 


0.3 


0.0604 

(-78.47%) 


0.0944 

(-66.33%) 


Indonesian 
to English 


- 


1.0 


0.1194 

(-57.41%) 


0.1627 

(-41.98%) 



Applying the query expansion technique in combination with the sense 
disambiguation technique resulted in slight performance improvements. As can be seen 
in Table 3, the improvements were 0.28%, 4.16%, and 5.81% for the German, the 
Spanish, and the Indonesian query sets, respectively. 



Table 3. Average retrieval precision (and performance drop compared to 
monolingual) of the English queries translated from the German, the Spanish, 
and the Indonesian queries for the automatic disambiguation and the expanded 
automatic disambiguation methods. 



Language 


Baseline 


Threshold 


Automatic 

disambiguation 


Expanded 

automatic 

disambiguation 


Monolingual 

(English) 


0.2804 


- 


- 


- 


German 
to English 


- 


1.0 


0.2312 

(-17.57%) 


0.2319 

(-17.29%) 


Spanish 
to English 


- 


0.9 


0.1745 

(-37.79%) 


0.1861 

(-33.63%) 


Indonesian 
to English 


- 


2.0 


0.1978 

(-29.45%) 


0.2141 

(-23.64%) 
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4.2 Analyses 

The results of our experiments demonstrate that the term-similarity based sense 
disambiguation does improve the retrieval performance of dictionary based CLIR 
performance. However, the degrees of improvement are not similar for all the query 
sets. The best result was obtained for the German queries (48.20%), followed by the 
Spanish queries (40.68%), and, lastly, the Indonesian queries (27.96%). Note that we 
compare the degrees of improvement instead of the average retrieval precision values. 
Since we used a different bilingual dictionary for each language query-set we cannot 
compare the average retrieval precision of one query set with another. 

Our further investigation revealed that significant retrieval performance 
improvements were achieved by queries containing phrases which were translatable 
non-ambiguously into English using the bilingual dictionaries. Table 4 shows the 
number of phrases in the original queries for each query set, and the number of such 
phrases that can be translated using the bilingual dictionaries. 



Table 4. The number of phrases in the original query-sets and the number 
of phrases that can be translated using the dictionaries. 





German 


Spanish 


Indonesian 




queries 


queries 


queries 


Original phrases 


36 


36 


38 


Translation phrases 


10 


10 


5 



From these observations, it can be said that a larger performance improvement using 
the sense-disambiguation technique can be achieved if there are more cases where 
words in the original queries’ language (German, Spanish, or Indonesian) can be 
translated deterministically into the target language (English) using the bilingual 
dictionaries. It is also worth mentioning that, in our earlier work [2], we discovered 
that, within a language’s query-set, there is a correlation between the number of 
phrases in a query and the percentage of retrieval improvement resulted from applying 
the technique. 

As for the query expansion technique, we observed that significant performance 
improvements were obtained when the technique discovered terms mutually associated 
with the existing query terms. Unfortunately, in many cases, the technique also added 
non relevant terms into the queries which resulted in retrieval performance drops. Such 
performance drops were significant in the case of queries with relatively small number 
of terms, in particular, queries which had gone through the sense disambiguation 
process. 

We also observed that there is a correlation between the number of phrases in the 
original queries and the technique’s effectiveness in improving retrieval performance. 
For every query in each of the query sets, we compared the number of phrases in the 
original query and the performance improvement resulted from using the query 
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expansion technique. Using the Spearman’s correlation coefficient method [1 1]- a non- 
parametric statistical tool- we obtained correlation coefficients of 0.366, 0.473, and 

0.238 for the German, the Spanish, and the Indonesian queries, respectively. The 
correlation between the number of phrases and the performance improvement are 
significant for the German and the Spanish query sets as their coefficients are well 
above the critical value for 24 observation data (queries) at a = 0.5, namely, 0.343. 
Note that for the German queries, we count each compound word as a phrase since it 
translates into a noun phrase in English. Our tentative explanation to this phenomenon 
is that, as members of a phrase tend to co-occur, so do their translations, and so, the 
query expansion technique has a better chance of finding the phrase members’ correct 
translations that were not identified by the translation process. 

Overall, our finding reemphasize the point that has been made by other researchers 
such as Ballesteros & Croft [4], i.e. that phrase recognition and translation are critical 
to the effectiveness of cross-language query translation. 



5 Summary 

The availability of machine-readable bilingual dictionaries has made dictionary-based 
approaches to CLIR cost effective. The result of our study suggests that the two major 
research issues in CLIR, namely, term ambiguity and phrase recognition and 
translation [3, 4, 10], are also the main sources of problem in dictionary-based query 
translation techniques. We have demonstrated that using statistical term similarity 
measures to enhance the dictionary-based query-translation CLIR method, particularly 
in term disambiguation and query expansion, can significantly improve retrieval 
effectiveness. 

Our work with term-similarity based sense disambiguation and query expansion has 
provided us with a vehicle for investigating the effect of various language-specific term 
characteristics, including term distribution parameters, on the retrieval performance of 
CLIR. This study is part of an on-going research in an effort to build a model for 
CLIR. Our short-term objective in this research is to identify the collective statistical 
property or properties among terms which correlates with the term ambiguity and 
phrase translation problems. 
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Abstract. Digital Libraries have gained tremendous interest with 
several research projects addressing the wealth of challenges in this field. 
While computational intelligence systems are being used for specific 
tasks in this arena, the majority of projects relies on conventional 
techniques for the basic structure of the library itself. With the SOMLib 
project we created a digital library system that uses a neural network- 
based core for the representation of the library. The self-organizing 
map, a popular unsupervised neural network model, is used to topically 
structure a document collection similar to the organization of real-world 
libraries. Based on this core, additional modules provide information 
retrieval features, integrate distributed libraries, and automatically label 
the various topical sections in the document collection. A metaphor 
graphics based interface further assists the user in intuitively under- 
standing the library providing an instant overview. 

Keywords: Self-Organizing Map (SOM), Document Clustering, Learn- 
ing, Distributed Digital Libraries, Dublin Core Metadata, Metaphor 
Graphics, Visualization, 



1 Introduction 

During the last years we have witnessed an uninterrupted rise of the amount 
of information available in electronic form. While the size and availability of 
electronic information has changed a lot, ways for representing and interacting 
with those collections could not keep pace. Most information repositories still 
present themselves as varieties of lists of entries, ranging from filename listings 
and commented lists of documents to manually created hierarchies of pieces of 
information, which usually try to find one single place for every document in 
the collection. Searching these collections requires users to define their queries 
in some boolean logic based expressions, specifying large numbers of keywords, 
synonyms and antinyms, requiring both knowledge of the problem domain as well 
as basic query formulation experience. Results of queries are usually presented as 
long lists of (both relevant and irrelevant) retrieved documents sorted following 
some ranking criteria, with the large overall number of documents retrieved 
usually inhibiting efficient search. Information on the documents retrieved from 
a collection is at the most presented as a rather long textual description of the 
available metadata. 
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On the other hand, taking a look at conventional libraries (which have a 
long history and thus had time to evolve and adapt to our needs) and the way 
we approach and query them, we find a completely different situation: Libraries 
usually exhibit a clearly detectable structure by organizing books by topic into 
sections and shelves. This structure allows us to gain insight into the contents 
of the library as well as to get a rough overview of the amount of information 
available on specific topics. When entering a library or large-scale book store, 
in spite of the overwhelming amount of information present in such locations 
users usually manage to orient themselves and find the way to their section of 
interest quite easily. Without being able to read the title of books from the far 
distance, not knowing actually where to find a book by a specific author or even 
without knowing a title or an author of a book, most people are able to locate 
the respective sections when looking for a dictionary, a poem collection or a story 
book for children. Searching a library can take several forms: you might start 
browsing from the entrance via different floors to any specific section and shelve, 
which is then searched entry by entry. Note, that at most libraries you find a 
map of the library at the entrance, giving an overview of books on which topic 
may be found in which section. A second approach may be by searching keyword, 
author and title catalogues. Third, you might also ask a librarian to help you 
find the requested pieces of information by giving a rough idea of the desired 
book. The outcome of such an inquiry is usually not only a list of titles or a pile 
of books, but also includes some recommendations based on the experience of 
the librarian. Furthermore, locating one book in the library usually leaves you, 
due to the topical structure, with several other relevant ones nearby. Once you 
find the corresponding shelve, by scanning the books sorted there, it is usually 
easy for you to tell the age of a book, the number of times it has been used before 
(at least in a public library rather than in a bookstore), as well as the amount 
and type of information to be expected in the books simply by looking at them. 
The cover of the book, the title, type of binding, the shape of the binding (brand 
new versus well-thumbed and almost torn apart), the size of the book, color and 
other properties of an item on the shelve contain a wealth of information that 
most people are accustomed to and able to interpret intuitively. Thus, it is easy 
for us to gain an overview of the contents of a library, the type of information 
present, how many items of a specific title can be found etc. All these features 
make orientation rather easy in spite of the wealth of information present. 

Thus, we find conventional libraries and article collections in some aspects 
very well suited for the task they are intended to serve, whereas in other aspects 
digital libraries undoubtly offer more possibilities. Adopting these characteris- 
tics of conventional libraries for electronic media to combine the benefits of the 
evolved structures of conventional systems with the benefits of digital systems 
has proven to be difficult. This is partially due to the mere amount of infor- 
mation growth. Reading and manually classifying all entries in an information 
repository to create an order similar to the one found in conventional libraries 
proves to be a sisyphean struggle, as does searching and browsing these huge 
collections. 
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In this paper we present the SOMLib [22] digital library system. It is based 
on neural networks, especially the self-organizing map (SOM) [13], trying to 
combine the benefits of conventional library structures and procedures with the 
enhanced capabilities of digital libraries by topically clustering documents on a 
2-dimensional map similar to conventional library organization. We demonstrate 
the SOMLib library system using the classic TLME Magazine article collection. 
This setting allows the evaluation of the system in the context of a real world 
document collection covering diverse topics while being generally intelligible. 
The collection is split into several subsets to model a series of consecutively re- 
leased collections. These sets of articles are treated as independent document 
archives represented by independent self-organizing maps. The SOMs are then 
integrated into a single map to model the creation of a meta-library [25]. All of 
these document archives are labeled using the LabelSOM method [26], providing 
an instant overview of the topics covered in the whole article collection in an 
organized way. Furthermore, the libViewer [24] is used to serve as a metaphor 
graphics based interface to the document collection, providing intuitive visuali- 
ation of the resources in an information repository by making metadata on the 
resources instantly intelligible. 

The remainder of this paper is organized as follows. We start with a de- 
scription of the document collection forming the basis for a set of independent 
libraries in Section 2. Next, the creation of SOMLib maps providing a topically 
ordered representation of the individual libraries using the self-organizing map 
is presented in Section 3. The integration of these individual, distributed library 
maps to form a single information repository is provided in Section 4, followed by 
the presentation of the LabelSOM method for automatically labeling the various 
independent maps in Section 5. Finally, the libViewer interface using metaphor 
graphics for the visualization of the libraries is described in Section 6. We round 
off the paper by giving an overview on related work in Section 7, ending up with 
some conclusions in Section 8. 



2 The Documents - A TIME Magazine Article Collection 

For the experiments presented hereafter we use the classic TLME Magazine arti- 
cle collection^. It consists of a collection of 420 articles from the TLME Magazine 
dating from the early 1960’s. This collection, while being small enough to be pre- 
sented in sufficient detail, provides the benefits of a real-world article collection 
covering a wide range of topics from foreign affairs to high-society gossip, thus 
forming an ideal testbed for the evaluation of our approach. To model a dis- 
tributed library consisting of subsequent releases of a magazine, we split the 
document collection into 6 parts consisting of the documents TOGO - T099, TlOO 
- T199, ..., T500 - T599. Please note, that the consecutive numbering is not 
complete, i.e. not all articles are available in the package. Thus we obtain 6 
sets of documents of different size with each set containing between 53 and 87 
documents. 

^ available at http://www.ifs.tuwien.ac.at/ifs/research/ir/ 
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Set 


95^ Articles 


Articles 


Dimension 


Map Size 


0 


63 


T000-T099 


1433 


6x10 


1 


85 


T100-T199 


1758 


7x10 


2 


87 


T200-T299 


1812 


7x10 


3 


72 


T300-T399 


2019 


7x9 


4 


60 


T400-T499 


1761 


6x9 


5 


53 


T500-T599 


1255 


6x7 



Table 1. Time Magazine Data - Experiment Setup 



To be used for map training, a vector-space representation of the single doc- 
uments is created. For each document collection a list of all words appearing in 
the respective collection is extracted while applying some basic word stemming 
techniques. Words that do not contribute to contents description are removed 
from these lists. Instead of defining language or content specific stop word lists, 
we rather discard terms that appear in more than 90% or in less than 3 articles in 
each collection. Thus, we end up with a vector dimensionality between 1255 and 
2019 for the 6 document sets, cf. Table 1. The individual documents are then 
represented by feature vectors using a, tf x idf, i.e. term frequency x inverse 
document frequency, weighting scheme [30] . This weighting scheme assigns high 
values to terms that are ’important’ as to describe and discriminate between 
the documents. These feature vectors are further used to train 6 self-organizing 
maps consisting of between 42 and 70 units. An overview of the experimental 
setup is provided in Table 1. 



3 SOM and Digital Libraries 

3.1 The Self- Organizing Map 

The SOMLib library is based on the self-organizing map [13] (SOM), one of the 
most prominent artificial neural network models adhering to the unsupervised 
learning paradigm. The model consists of a number of neural processing ele- 
ments, i.e. units. Each of the units i is assigned an n-dimensional weight vector 
TUi, rrii G 5ft". It is important to note that the weight vectors have the same 
dimensionality as the input patterns. 

The training process of self-organizing maps may be described in terms of 
input pattern presentation and weight vector adaptation. Each training iteration 
t starts with the random selection of one input pattern x{t). This input pattern 
is presented to the self-organizing map and each unit determines its activation. 
Usually, the Euclidean distance between the weight vector and the input pattern 
is used to calculate a unit’s activation. In this particular case, the unit with the 
lowest activation is referred to as the winner, c, of the training iteration, as given 
in Expression (1). 



c : mc{t) = min l]x(t) — mi{t)\\ 



( 1 ) 
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Finally, the weight vector of the winner as well as the weight vectors of 
selected units in the vicinity of the winner are adapted. This adaptation is im- 
plemented as a gradual reduction of the difference between corresponding com- 
ponents of the input pattern and the weight vector, as shown in Expression 
( 2 ). 



nii{t + 1) = rm{t) + a{t) ■ hd{t) ■ [x{t) - mi{t)] (2) 

Geometrically speaking, the weight vectors of the adapted units are moved a 
bit towards the input pattern. The amount of weight vector movement is guided 
by a so-called learning rate, a, decreasing in time. The number of units that 
are affected by adaptation is determined by a so-called neighborhood function, 
hci- This number of units also decreases in time such that towards the end of 
the training process only the winner is adapted. Typically, the neighborhood 
function is a unimodal function which is symmetric around the location of the 
winner and monotonically decreasing with increasing distance from the winner. A 
Gaussian may be used to model the neighborhood function as given in Expression 
(3) with ri representing the two-dimensional vector pointing to the location of 
unit i within the grid, and \\rc — d|| denoting the distance between units c, 
i.e. the winner of the current training iteration, and i in terms of the output 
space. It is common practice that at the beginning of training a wide area of 
the output space is subject to adaptation. The spatial width of units affected 
by adaptation is reduced gradually during the training process. Such a strategy 
allows the formation of large clusters at the beginning and fine-grained input 
discrimination towards the end of the training process. The spatial width of 
adaptation is guided by means of the time- varying parameter cr. 

/.c(t)=exp(-fc^) (3) 

The movement of weight vectors has the consequence, that the Euclidean 
distance between input and weight vectors decreases and thus, the weight vec- 
tors become more similar to the input pattern. The respective unit is more likely 
to win at future presentations of this input pattern. The consequence of adapt- 
ing not only the winner alone but also a number of units in the neighborhood 
of the winner leads to a spatial clustering of similar input patters in neighbor- 
ing parts of the self-organizing map. Thus, similarities between input patterns 
that are present in the n-dimensional input space are mirrored within the two- 
dimensional output space of the self-organizing map. The training process of 
the self-organizing map describes a topology preserving mapping from a high- 
dimensional input space onto a two-dimensional output space where patterns 
that are similar in terms of the input space are mapped to geographically close 
locations in the output space. 

Gonsider Figure 1 for a graphical representation of self-organizing maps. The 
map consists of a square arrangement of 7 x 7 units, shown as circles on the 
left hand side of the figure. The black circle indicates the unit that was selected 
as the winner for the presentation of input pattern x{t). The weight vector of 
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Fig. 1. Architecture of a 7 x 7 self-organizing map 



the winner, rndt), is moved towards the input pattern and thus, mdt + 1) is 
nearer to x{t) than was rric{t). Similar, yet less strong, adaptation is performed 
with a number of units in the vicinity of the winner. These units are marked as 
shaded circles in Figure 1. The degree of shading corresponds to the strength of 
adaptation. Thus, the weight vectors of units shown with a darker shading are 
moved closer to x{t) than units shown with a lighter shading. 



3.2 Text Classification 

The SOM has been used repeatedly for the unsupervised classification of free- 
form text documents, cf. [12,16,17,28]. Text documents can be thought of topical 
clusters in the high-dimensional feature space spanned by the individual words 
in the documents. A trained SOM thus represents a topical ordering of the 
documents, meaning that documents on similar topics are located close to each 
other on the 2-dimensional map. This is comparable to what one can expect 
from a conventional library, where we also find the various books ordered by 
some contents-based criteria. Thus, the SOM offers by its very architecture an 
ideal way for the organization of document repositories. 

The items to be included in the SOMLib library system are represented in the 
form of feature vectors, which are created by parsing the texts and processing 
the resulting word histograms to provide a compact representation of the texts. 
These feature vectors are used as input to train a standard self-organizing map. 
By determining the size of the map the user can decide which level of abstraction 
she desires. The larger the maps, the more units are available for representing 
the various topics in the document archive, while a smaller SOM produces a less 
detailed representation of the collection. 

Figure 2 presents the first of the 6 SOMs trained with the subsets of the 
TIME Magazine articles. The units are represented by the squares in the map, 
with the articles mapped onto a unit being listed in the unit area. By taking a 
closer look at the individual documents mapped onto identical or neighboring 
units we find, that the SOM has succeeded in producing a topically ordered 
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representation of the articles similar to how a human being would arrange the 
articles in shelves with articles on similar topics being located close to each other. 

For example, on unit (0/0)^ we find article T042 entitled The View from 
Lenin Hills dealing with a discussion between Nikita Khrushchev and Soviet 
artists at the Lenin Hill Reception Palace, next to article T018 - Who’s in 
Charge Here ? about the failure of Khrushchev’s virgin land plan for agriculture 
on unit (1/0) or T032 - Party Time on unit (0/1) on the New Year’s Eve party at 
the Kremlin. On the opposite corner of the map on unit (5/9) we find documents 
dealing with the problems of the reintegration of Kolwezi into the Congo dis- 
cussed at a meeting between officials in article T065 - Tea and Harmony, next 
to three articles on unit (4/9) {T021, TO 48 , T058 entitled The India-Rubber 
Man; Round 3; and Tshombe’s Twilight), providing more detailed information 
on the background of the Congo troubles. Other groups of documents found on 
this map deal, for example, with the war in Vietnam, the relation between In- 
dia, Pakistan and China etc. We leave it up to the reader to explore the other 
topical sections found in this and the remaining library maps^. Obviously, the 
resulting representation, while nicely organizing the documents by topic, does 
not facilitate understanding the document archive by solely listing the document 
numbers. This would not change a lot if we chose to use the headlines of the 
articles as labels instead of document numbers. We refer the reader to Section 5 
for a more intuitive representation of the topical sections based on automatically 
created labels. 

The clustering capabilities of the SOM, apart from providing a nicely or- 
ganized representation of a document archive, greatly facilitate interactive in- 
formation retrieval and browsing. A query is treated like a document, parsed 
to create a feature vector representation and presented to the map, retrieving 
the documents mapped onto the winning unit. Starting from this point the user 
finds similar documents on related topics on the neighboring units, allowing an 
interactive exploration of the document archive. If large numbers of documents 
are retrieved, they can be parsed to create a smaller SOM, structuring those 
documents at a finer granularity. 

One of the benefits of a digital library system is, that documents need not 
be assigned a single location. While all articles in the presented application are 
currently assigned to exactly one unit, this is not a requirement of the system. 
Rather, we want articles covering more than one single topic to be assigned to 
multiple units. This can be easily achieved by not creating one feature vector 
description per document but rather creating one feature vector per section of 
a document allowing multiple assignments of documents to units. 

Please note, that the units of a map do not solely represent the number of 
documents mapped onto them during the training process. Rather, the weight 

^ We will use the notation (x/y) to refer to the unit located at column x and row y 
starting with (0/0) in the upper left corner. 

® Due to space considerations we can only present a subset of the maps in this paper. 
However, the individual maps and the articles are available for interactive exploration 
at http:/ /www.ifs.tuwien.ac.at/ifs/research/ir 
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Fig. 2. 6 X 10 SOM of Time Magazine articles set 0 



vector of a unit serves as a representative for a topic within the repository. Thus, 
new documents being added to the library can be mapped onto an existing SOM, 
much like new books can be added to a library bookshelf, while out-of-date 
documents can be removed from the map. The representation of topics rather 
than library items supports the much more fluid nature of digital collections 
as opposed to the rather static, conserving conventional libraries. A trained 
SOM thus can serve as a repository for a newspaper, with the daily articles 
being mapped onto the appropriate sections of the map similar to conventional 
newspaper organization. If new topics emerge, this results in a higher mapping 
error, because the SOM is not able to find a unit representing the new topics 
appropriately. In this case, either a new SOM can be trained to represent these 
emerging new topics using the documents that could not be mapped well enough, 
or one of the incrementally growing variations of the SOM may be used [5] . 

4 Distributed Document Collections 

The training process of a SOM assumes, that all data for training is available 
locally. This assumption is not generally true, especially in the arena of digital 
libraries, which generally do not exist as static collections of text in one central 
location. Rather, we find the text collections distributed over several sites, often 
being highly specialized in certain topics, which we want to combine and acces 
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via one central location. Furthermore, especially in the case of periodicals, we 
find small collections of documents being added to the library at intervals, with 
each edition of a journal or each annual collection possibly being represented by a 
small library SOM. What we want to have is a way to integrate those distributed 
libraries into a kind of higher-level library without having to transfer all data to 
a central location to produce the training data for the SOM and without having 
to train the whole map again. 

The SOMLib system allows the integration of those distributed collections 
by using the weight vectors of the various SOMs as input to train a higher-level 
SOM [19,22,25]. Since the weight vector structures of the independently trained 
SOMs differ, i.e. contain different content terms as identified during the docu- 
ment parsing process, they are merged to form a weight vector representation 
containing all content terms of the individual collections, which, in our example, 
leads to a new feature space of 3303 content terms. Instead of using the fea- 
ture vectors of the 420 articles as input for the SOM representing the complete 
document collection, we use the weight vectors of the units in the individual 
SOMs representing the topics present in the various collections. This results in 
the integrating 10 x 15 SOM being trained with 359 input vectors of dimension- 
ality 3303, forming a topologically ordered mapping of the topical sections of the 
individual library maps. 

The integrating SOM given in Figure 3, instead of representing the document 
vectors on its units, lists the units of the 6 individual SOMs, which in turn repre- 
sent the corresponding articles. Again, we find the topology preserving mapping 
capabilities of the SOM as in the previous example which was trained directly 
using the document description vectors. We now find the units describing the 
documents T042, T032, T018 on the Soviet Union, previously located in the 
upper left corner of map 0 (presented in Figure 2) mapped onto units (8/14) 
and (9/14) of the integrating map. Taking a look at the articles mapped onto 
these units we find, that they all cover topics related to the Soviet Union. For 
unit (9/14) these are articles T229 - Russia: A Senior Citizen, T542 - Russia: 
Better Things for Better Living through Chemistry; T539 - Russia: Something 
for the Soil. Mapped on unit (8/14) we find 8 units from 4 different maps repre- 
senting a total of 9 documents all dealing with the Soviet Union, surrounded by 
further units representing units and thus articles on this topic. Other document 
clusters identified on the individual maps can be found like, e.g. the cluster on 
the Vietnam war in the lower right corner of the map. For a more detailed and 
intuitive representation of the topical clusters found in the map, please refer to 
Section 5 presenting the LabelSOM method. 

As the size of a SOM greatly influences the granularity of the archive rep- 
resentation, each map can be optimized, representing the articles at the desired 
level of abstraction. These maps can then be integrated to form the higher-level 
library, greatly reducing training times by using a smaller number of weight vec- 
tors as input instead of all the document feature vectors. Still, the quality of 
the resulting integrating map is comparable to a map trained directly with the 
feature vectors of the documents [19]. However, not only whole SOMs, but also 
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Fig. 3. 10 X 15 SOM integrating 6 independently trained SOMs 



parts of a SOM down to individual units can be integrated that way. This facil- 
itates the creation of personal libraries by allowing every user to choose, which 
sections of a library she wants to integrate, using only those weight vectors of a 
SOM library as input to her personal SOM. For example, a user may choose to 
create a personal library covering only articles related to the war in Vietnam. 
Thus she can choose to train her personal library SOM using only the weight 
vectors of those units on the individual maps that represent this topic. This 
makes any articles mapped onto the respective maps later on instantly accessi- 
ble on her personal library map. Queries presented to the SOM are passed only 
to those units of maps represented by the winning unit, retrieving articles from 
the appropriate locations instead of processing the query at all maps. 



5 LabelSOM: Labeling the Library 

As we have seen, the SOM offers itself to the representation of document archives 
by organizing the documents according to their contents. However, the contents 
of the various areas on the map is not visible as such. What we want to have is 
- similar to conventional libraries - a kind of guide map to the repository, where 
the individual sections are being labeled with keywords. 

Present SOMs for document archive representation mostly are labeled man- 
ually, i.e. the documents on a particular unit are read and based on the topics 
found on the respective unit a set of keywords is assigned, similar to the way the 
results were described in this paper so far. However, manually assigning labels is 
highly labour intensive by requiring manual inspection of all data items mapped 
onto the units. What is needed, is a way to automatically label the units and 
clusters of a SOM to make the structures learned by the map explicit, i.e. to 
give a justification for a particular mapping. 
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Fig. 4. LabelSOM: Labels for the 6 x 10 SOM 



The LabelSOM method [23,26] developed in the course of the SOMLib project 
allows to automatically describe the categories of documents by extracting the 
features learned by the SOM and thus assists the user in understanding the 
data collection presented by the map. It is built on the observation, that the 
weight vectors of a trained SOM serve as prototypes of a set of input signals, 
i.e. they exhibit the features of the articles mapped onto a particular unit. Thus 
we can assume that those features (i.e. content terms), that are shared by a 
majority of documents mapped onto a particular unit, serve as a description for 
the respective unit. The LabelSOM method finds those features that are highly 
similar for all input signals mapped onto a particular unit and thus best serve as 
a label for it. The selection of features is based on the quantization error vector 
which stores the quantization error for each feature at a unit by determining the 
summed Euclidean distances for all input signals mapped onto that particular 
unit. Features exhibiting a low quantization error can then be chosen as the most 
likely candidates for labeling the respective unit. 

Figure 4 gives the labels selected by the LabelSOM method for the map 
depicted in Figure 2. Due to space considerations we can only present the labels 
for a subset of all units. However, the quality of the other labels is quite similar^. 
Taking again a look at the units of Figure 2 discussed before, we find unit (0/0) 
and neighboring units, located in the cluster of articles on the Soviet Union, 

The labels for all maps are provided at http://www.ifs.tuwien.ac.at/ifs/research/ir 
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labeled with, amongst others, khrushchev and nikita. The rest of the labels give 
more detailed information on the documents on the respective units, e.g. boss, 
land, committee on unit (1/0). For the second cluster discussed before, we find in 
the lower right corner unit (4/9) labeled kolwezi, katanga, tshombe, elisabethville, 
naming the locations and key players of the 3 articles on this unit. 

For the integrating map given in Figure 3 representing all documents in the 
collection based on the distributed maps, the labels are of similar quality, as 
depicted in Figure 5. In the lower right corner we find a group of units sharing 
labels like nikita, khrushchev, moscow, russia, clearly characterizing the Soviet 
Union article cluster identified before. Each unit has more detailed labels like 
farm, chemical on unit (9/14) or peking, Chinese on unit (7/14) dealing with 
Russian-Chinese relationships. This marks the overlap of two clusters of docu- 
ments, namely a Soviet Union cluster and a cluster with articles on China-related 
matters located right next to it. This arrangement of articles is typical for the 
topical mapping of the SOM. We again leave it to the reader to guess the subject 
matters of the documents represented by the remaining units. 



6 Visualizing Metadata 



While the spatial organization of documents on the 2-dimensional map in com- 
bination with the automatically extracted concept labels supports orientation 
in and understanding of an unknown document repository, much information 
on the documents cannot be told from the resulting representation. Information 
like the size of the underlying document, its type, the date it was created, when 
it was accessed for the last time and how often it has been accessed at all, its 
language etc. is not provided. Since this information provides valuable guidance 
in interactive searching and exploration, methods need to be found to convey 
this information to the user in an intuitive way, refraining from the widely used 
method of simply listing this metadata as textual descriptions of the documents. 
Rather, we want to use well-known metaphors for the representation of the prop- 
erties of a document. Thus, we are currently developing the libViewer^ , which is 
a User Interface to a digital library. It is implemented as a Java- Applet allow- 
ing the simple representation of and interaction with document archives via the 
World Wide Web. 

A set of metaphors is implemented to allow a flexible mapping of metadata 
attributes to graphical representations in order to best suit the requirements of 
the user as well as the resources present in the library. A number of mappings 
can be defined to optimize the representation for the requirements of a digital 
library, ranging from a rather realistic representation of the items in the library to 
a more abstract one designed for special exploration purposes. We have currently 
realized the following metaphors in the Zz&Uzewer interface: 

® Preliminary prototype of the lib Viewer is available for exploration at 
http://www.ifs.tuwien.ac.at/ifs/research/ir/libViewer 
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Fig. 5. LabelSOM: Labeled 10 x 15 SOM integrating 6 maps 



Representation Type: Each piece of work in a digital library needs a physical 
representation. A set of templates is defined to represent, for example, hard- 
cover books, paperbacks, binders, manuscripts, boxes for audio, video and 
software components or links to other libraries in order to provide a rather 
realistic visualization of library resources. 

Color: Apart from the physical appearance of an object in the library, the 
color is the most dominant feature, which can easily be detected at long 
distance. Thus, color can be used to represent a variety of attributes in a 
very distinguishing way, such as language, publication series, genre, topical 
classification etc. 

Size: The amount of information available in a book or magazine is intuitively 
judged from the size of the physical object, e.g. the number of pages, based 
on the thickness of a book or box, thus measuring the amount of information 
available from a specific resource. 

Format: Format conveys, next to the type of a document, a lot of information 
on the genre of a document, considering, for example, oversize format books 
such as an atlas or art collection books vs. small paperbacks. 

Logo: When browsing a library, one automatically and actually without 
noticing recognizes the logos of well-known publishers, associating them with 
special types of publications. Thus, while making the library representation 
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look more realistic, a lot of information can be conveyed using a company 
logo or the initial letter for publisher representation. 

— Text: In spite of the limited space on the binding, a lot of information is 
provided by both the text, such as title or authors listing, as well as the type 
of text representation, such as different fonts or font colors and their impact 
on the perception of the document. 

— New Book: Books and other items that have been added to a collection 
only recently usually can be identified at large distance by their somewhat 
shinier color. Thus, glare effects and reflections can be used to highlight 
certain entries in a collection. 

^ Used Books: Contrary to recently added items, books that have been in a 
collection for a long time and which are being consulted frequently usually 
show some signs of intensive usage by crippled, well-thumbed bindings, torn- 
off edges etc. 

— Dust: Whereas items in a library that are frequently consulted tend to remain 
rather ’clean’, dust usually settles on books that have not been referenced 
for a long time. 

— Bookmarks: Similar to conventional books we can use bookmarks of different 
colors to mark books we are currently reading. 

— Shelve Position: When taking a look at bookshelves, we find, that books that 
are being used frequently, usually are not neatly aligned with all the other 
books nearby, but rather tend to stand out. In terms of query processing, this 
metaphor may be used to indicate the relevance of a resource with respect 
to a specific query. 

— Location: Similar to conventional libraries, resources on identical topics 
should be located next to each other in a bookshelf. 

Based on these metaphors we can define a set of mappings of metadata at- 
tributes to be visualized, allowing the easy understanding of documents, similar 
to the usage of Chernoff faces for multidimensional space representation [7]. 
However, care must be taken in the selection and definition of these multifunc- 
tional elements, so that the encodings can be broken by every user, avoiding the 
creation of graphical puzzles [32]. 

Figure 6 provides a sample representation of a digital library containing a 
number of books, technical reports, papers and multimedia resources as well as 
hypertext links. Please note, that, for this example we chose to use a different 
document collection than in the previous sections in order to be able to demon- 
strate a variety of capabilities of the lib Viewer interface. However, the mapping of 
attributes is flexible and can be arranged to suit a given document collection and 
application domain, with more details on this topic being provided at the end of 
this section. The various document types can be easily identified, like, e.g. the 
libViewer and somViewer technical reports in green binders, the 4 different Lan- 
genscheidt dictionaries as yellow hardcover books or various paperback books 
published by e.g. Springer. They are created by assigning each resource type a 
corresponding document type representation. In the given example, both journal 
papers as well as conference papers are mapped onto the paper representation 
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metaphor. The difference between conference and journal papers is indicated by 
their color with the latter appearing in a darker color than the white confer- 
ence papers. Thus, the hierarchy of document types defined in the Dublin Core 
metadata can be mapped onto a hierarchy of metaphorical representations. 

Further attributes are mapped in a similar fashion, e.g. having the logo iden- 
tify the publisher of a book if a corresponding logo is available (e.g. Springer, 
Langenscheidt, Vieweg), or having the thickness of the binding represent the 
size of the underlying resource as for the different Langenscheidt Dictionaries. 
Another straight-forward mapping is provided by the degree to which dust has 
accumulated on the back of the books, ranging from a few dust particles to a 
spider-web covering half of a book that has not been referenced for a long time, 
as it is the case for the second book in the lower shelve. On the other hand, the 
last book in the lower shelve is clearly identified as being frequently referenced 
due to its rather distorted, well-thumbed binding indicating its frequent use. 

Please note, that, although possible, it is not the goal of this system to 
represent a library as realistically as possible in terms of making all books look 
as similar as possible to their real-world counterparts. Rather, we want to create 
a metaphorical representation which is optimized for exploration and intuitive 
understanding of document collections or search results. These mappings can 
differ for the specific information and exploration needs as well as for different 
information repositories. Thus, the mapping described in this example is just 
one out of many that are possible. For a different collection we might want to 
map the language of the documents to the color in the representation to clearly 
identify foreign language books. Another possibility would be to assign the colors 
of books based on their year of publication, making the various entries in e.g. a 
journal collection or news magazine archive intuitively visible even when they 
are not sorted by date. The alignment of books may be used to indicate the 
relevance of an item in the collection towards a query for the representation of 
search results. 

Combining the lib Viewer interface with the spatial arrangement of documents 
provided by the SOMLib system results in a set of shelves as depicted in Figure 
7, providing an intuitive interface to a digital documents collection. Following 
the promising initial evaluation of metaphor graphics and the libViewer system 
with a small group of users we are currently preparing a larger usability evalua- 
tion on different document collections including both persons with and without 
experience in computer and digital library usage. 



7 SOMLib and Related Work 

Document clustering has been identified as one of the key issues in digital library 
exploration and has thus been addressed in a number of projects like the BEADS 
system [6] using multidimensional scaling or the BiblioMapper [31] using a hier- 
archical clustering algorithm. A technique classifying documents in a hierarchical 
topic structure is presented in [14], the application of the multiple cause mixture 
model for text categorization using the Reuters document collection is reported 
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Fig. 6. libViewer. Visualizing metadata of documents in a digital library 



in [29] . The self-organizing map and related models have been used in a number 
of occasions for the classification and representation of document collections. 
Among the most prominent projects in this arena is the WEBSOM system [12] 
representing over 1 million Usenet newsgroup articles in a single huge SOM. A 
variation of this approach using hierarchically organized SOMs is described in 
[18] using data form the CIA world faetbook. 

The need for and benefits of integrating distributed collections is especially 
strong in the field of digital libraries and is thus being addressed in a number 
of projects concerning the interoperability of and access to distributed systems 
[3,10,20]. Similar to libraries being interconnected by some organizational net- 
work, the combination of several independently managed, possibly highly spe- 
cialized information repositories is required with the main goal being to define 
an interface via which these systems can be integrated seamlessly. With the 
SOMLib system being based on the vector space representation of documents, 
integration is supported on the level of document representation, allowing the 
user to build personal libraries of his or her interest. Furthermore, it allows the 
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individual SOMs to stay smaller still providing a high level of detail, with new 
releases of document collections being integrated at a higher level. 

The design of user interfaces allowing the user to understand the contents 
of a document archive as well as the results of a query plays a key role in 
many digital library projects and has produced a number of different approaches. 
[1,2,4,11]. However, most designs rely on the existence of a descriptive title of 
a document to allow the user to understand the contents of the library, or use 
manual assignment of keywords to describe the topics of the collection as used 
in the WEBSOM project, where units were labeled with the newsgroup that a 
majority of articles on a specific node came from. The LabelSOM method allows 
now to automatically label the various areas of the library map with keywords 
describing the topical sections based on the training results. This provides the 
user with a clear overview of the contents of a SOM library map similar to the 
maps provided at the entrance to conventional libraries. 

The necessity to visualize information and the result of searches in digi- 
tal libraries has gained increased interest recently. A set of various visualization 
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techniques for information retrieval and information representation purposes was 
developed at Xerox PARC as part of the Information Visualization Project [27]. 
Information is depicted in a 3-dimensional space with the focus being on the 
amount of information being visible at one time and an easily understandable 
way of moving through large information spaces. As one of the first examples of 
metaphor graphics for digital library visualization we may consider the Book- 
house project [21], where the concept of a digital library is visualized using the 
representation of a library building with several rooms containing differing sub- 
collections and icons representing a variety of search strategies. At the CNAM 
library, a virtual reality system is being designed for the visualization of the 
antiquarian Sartiaux Collection [9], where the binding of each book is being 
scanned and mapped into a virtual 3-dimensional library to allow the user to 
experience the library as realistically as possible. The Intelligent Digital Library 
[8] integrates a web-based visual environment for improving user-library interac- 
tion. Another graphical, web-based tool for document classification visualization 
is presented in [15]. While these methods address one or the other aspect of doc- 
ument, library and information space visualization, none of these provides the 
wealth of information presented by a physical object in a library, be it a hard- 
cover book, a paperback or a video tape, with all the information that can be 
intuitively told from its very looks. Furthermore, many of the approaches de- 
scribed above require special purpose hardware, limiting their applicability as 
interfaces to digital libraries. The lib Viewer provides a flexible way of visualizing 
information on the documents in a digital library by representing metadata in 
an intuitively understandable way using standard java technology. 



8 Conclusions 

We have presented a digital library system based on the core of a neural network, 
namely the self-organizing map (SOM). The SOM offers itself by its very archi- 
tecture for the representation of document archives. Documents are organized 
on a 2-dimensional map according to their topic. This facilitates both retrieval 
of documents as well as intuitive interactive browsing by finding documents on 
similar topics nearby once you are pointed towards a map area by the map- 
ping of the query. Distributed libraries can be integrated, allowing the flexible 
creation of higher level libraries and personal bookshelves by integrating only 
subparts of maps of your personal interest. Applying the LabelSOM method au- 
tomatically assigns keywords to the units of the SOM describing the contents 
of the various map areas. Thus, the labeled SOM can actually be read and un- 
derstood as a guide map to the document archive. In the lib Viewer interface 
well-known graphical metaphors are used to produce an intuitively understand- 
able representation of the metadata of the documents. This type of information 
space visualization allows the intuitive and straight forward analysis of large 
collection of documents, providing an ideal setting for interactive browsing and 
exploration. 
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Abstract. The development of a European digital library for grey lit- 
eratnre is described. The aim has been to provide a digital library for 
scientists working in the areas of information science and applied mathe- 
matics and also to build a test-bed for research activities. The service has 
been implemented as part of NCSTRL (the US Networked Computer Sci- 
ence Technical Reference Library) and developed, extending the Dienst 
system nsed by NCSTRL, to meet the reqnirements of the European 
scientific community. The additional functionality is described and the 
difficulties encountered when trying to extend an existing architecture, 
protocol and system are discussed. 



1 Introduction 

The aim of the Digital Library Initiative of the European Research Consortium 
for Informatics and Mathematics (ERCIM) is to promote the development of 
digital library technology in Europe. Since 1996, a series of research-oriented ac- 
tivities, mainly sponsored by the DELOS working group, have thus been organ- 
ised, e.g. workshops, conferences, collaborative studies on DL-related research 
issues. However, towards the end of 1997, ERCIM also decided to undertake 
an implementation activity by setting up its own digital library for documenta- 
tion produced by its member institutes: the ERCIM Technical Reference Digital 
Library (ETRDL). The intention was two-fold: 

— to assist ERCIM scientists in making their research results immediately avail- 
able world-wide and provide them with appropriate facilities for accessing 
the technical documentation of others working in the computer science or 
related areas; 

— to provide the ERCIM DL group with a test-bed for experimental activities 
such as the implementation of new functions or services. 

ETRDL has thus received funding from ERCIM towards the setting up of the 
DL service and from DELOS for DL research activities. 

The first prototype of ETRDL was released in 1998 [1],[2]; after a one-year 
period of testing and refining, the ETRDL service is now available for the ERCIM 
Librarians and scientists and for the general public at http://www.iei.pi.cnr.it/ 
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DELOS/ETRDL. We are now working on developing and testing additional 
functionality that will be implemented in ETRDL-2. 

In this paper, we describe the work involved in implementing this first version 
of ETRDL^, discuss the reasons why we decided to base our digital library on 
an existing service (the US Networked Computer Science Technical Reference 
Library - NCSTRL) and adopt the same protocol (Dienst), explain why the 
services offered by NCSTRL did not entirely satisfy the requirements of a digital 
library for European research institutions, and illustrate the problems that arose 
when trying to implement additional functionality in an existing system. The 
paper is thus organised as follows: Section 2 outlines the services that should be 
provided by a European digital library for scientists and compares them with 
those offered by NCSTRL; Section 3 describes the basic functionality provided 
by Dienst and Section 4 its application and extension in ETRDL; Section 5 
discusses the language needs of a multilingual community and the provisions we 
are making; the final section indicates our plans for future developments. 



2 What services should a DL provide for the ERCIM 
scientific community? 

The first step towards the development of ETRDL was a survey of the require- 
ments of the ERCIM member institutions in order to determine how best to 
satisfy them. Our starting point was the library and an analysis of how the tech- 
nical documentation produced by our institutes is traditionally processed and 
managed. This is because ETRDL is seen as an extension of the traditional in- 
stitutional library services through the creation of an infrastructure connecting 
the separate technical collections of the single institutions and providing them 
with a gateway to related scientific collections. 

A digital library has been defined as “an institution that performs and/or 
supports (at least) the functions of a library in the context of distributed, net- 
worked collections of information objects in digital form” [3]. This is the per- 
spective taken by ETRDL. We intend that our digital library should provide far 
more than a dynamic remote search functionality; it should provide users with 
a complete service. While it is true that the presence of electronic rather than 
paper documents completely revolutionises information search and document 
access possibilities - by eliminating boundaries of space, time, and location - it 
is also true that, in order to exploit such capabilities, a number of tasks must 
be performed. Some of these are very similar to those of the traditional library 
even if their execution is different: the documents must be acquired, described 
bibliographically, catalogued, and retrieved. Others are new: for the librarian, 
they include issues such as the organisation and preservation of digital collec- 
tions, security, copyright, control of versions, updating; for the users, the ways 

^ The ERCIM Technical Reference Digital Library (ETRDL) is a collaborative activity 
in which eight organisations (CNR, CWI, GMD, INRIA, SICS, INESC, SZTAKI, 
FORTH) currently participate. It is sponsored jointly by ERCIM, by the DELOS 
Working Group (ESPRIT LTR No. 21057), and by the participating institutions. 
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in which documents can be accessed and manipulated. Our intention is that 
ETRDL should provide a full set of functionality for the performance of such 
tasks. 

The ETRDL system has thus been developed in terms of the services to be 
provided for two main user classes: the librarian, and the scientist. The librarian 
is responsible for the management of the information; the scientist is seen as 
both a potential seeker and provider of information. ETRDL aims at providing 
both the librarian and the scientist with an integrated work environment from 
which any of the DL services on the ERCIM collections can be accessed, in the 
preferred language of the user^. 

This environment must take into account three important features of the 
collections of technical documentation produced by the ERCIM member institu- 
tions: i) for historical and cultural reasons, the boundaries between informatics 
and mathematics are not well defined in these collections; ii) the documents are 
written in many different languages; iii) the collections contain different types 
of documents - currently ranging from technical reports to pre-prints of journal 
articles - with different characteristics and different potential life-times. 

These features have very much influenced the design of the services to be 
offered by ETRDL because they imply functionality that: 

— provides users with local language interfaces 

— searches across languages 

— searches by subject 

— selects sub-collections by date, language, type 

— handles withdrawal and update as well as submission capabilities. 

Another important requirement was also identified: we did not want to create 
a DL in isolation but to develop a tool that would encourage the dissemination 
of ideas and communication between researchers around the world working in 
computer science or applied mathematics. The obvious solution was to follow 
the direction of an initiative already underway in the US: NCSTRL - the Net- 
worked Computer Science Technical Reference Library [4], with a very similar 
core motivation: to improve early and detailed communication of research re- 
sults across the community [5]. This initiative has extensive visibility, with a 
large number of European and North American participating institutions. This 
decision implied employing the Dienst system [6], [7], adopted by NCSTRL for 
disseminating, searching and accessing its documents; Dienst is an open system, 
independent of NCSTRL and can be extended to meet the needs of other appli- 
cations. We thus decided to build our collection along the same lines as those 
established by NCSTRL and to adopt the same basic infrastructure. 

However, there was not complete compatibility between the service offered by 
NCSTRL and the requirements of the ERCIM digital library. NCSTRL has fo- 
cussed on the provision of an efficient search and retrieval functionality for online 
documentation that places the emphasis on the rapid dissemination of technical 

^ The ERCIM scientific community currently consists of 14 national institutions speak- 
ing 13 different major European languages. 
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documentation, whereas our aim with ETRDL is to provide a full set of inte- 
grated library services. Neither is there total homogeneity between the NCSTRL 
and ETRDL collections: NCSTRL contains computer science literature; ETRDL 
extends its scope to include applied mathematics, an important area of activ- 
ity within ERCIM. The ERCIM digital library has thus been implemented as 
part of NCSTRL collection, but with its own distinguishing characteristics and 
services. 

3 Dienst as Infrastructure for ETRDL 

A logical consequence of our decision that ETRDL should form part of the larger 
NCSTRL collection was the adoption of the Dienst infrastructure. Dienst is the 
term used by its developers to refer to a conceptual architecture for digital li- 
braries, a protocol for communication in the architecture, and a software system 
implementing the architecture. The model of the Dienst digital document (DD) 
has two components: the bibliographic description and the “body” of the docu- 
ment; each document has a globally unique name (URN). 

Here below, we give a brief outline of the main features of the reference 
implementation that we adopted^, before discussing in the following section some 
of the measures we have taken in order to specialise it to meet the requirements 
of ETRDL. 

The Dienst distributed digital library services can be logically divided in four 
classes: 

— A Repository Service that provides the mechanisms for storage of and access 
to the digital documents. 

— An Index Service that provides the mechanisms for the discovery of DDs. 
It stores meta information about documents in the collection. Queries sub- 
mitted to this service return a set that contains the URNs of the DDs that 
match the query. 

— A Meta Service that provides a directory of locations of all other services. 
This service provides the mechanisms to identify the index servers of a dis- 
tributed digital library collection and to manage meta-information about 
these servers. 

— A User Interface Service that provides a human front-end to the other ser- 
vices. 

Each of these services is accessible via a well-defined open protocol - a set of 
service requests - that defines the public interface to the service. The service 
requests are specified by a signature and a semantics. The signature consists 
of a description of the request and the result formats. The service requests are 
implemented as protocol messages whose format is given in terms of i) the name 
of the service that is to handle the message, ii) the service request and iii) one 
or more, optional, arguments. All messages are encoded into URLs. 

® The activity described here began in 1997 with Dienst version 4.1.9 and all observa- 
tions refer to this version; both NCSTRL and Dienst have progressed since then. 
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The “openness” of this protocol means that it is possible to add new services 
to an existing architecture, to add new service requests to existing services, to 
specialise service requests and to replace an existing service implementation by 
alternative implementations. 

A service is implemented by a server. There are no constraints on the way in 
which a server can be implemented. The only obligation is that it implements 
the functionality of the corresponding service, i.e. it accepts the service requests 
allowed by the protocol and processes them correctly with respect to the protocol 
semantics. 

The open-architecture also allows the creation of different federated digital 
library instances. An instance is the result of aggregating a collection of servers, 
that communicate via the protocol. The functionality of the digital library in- 
stance is a result of the union of the service requests from the aggregated servers. 

The version of Dienst on which we based our implementation had the follow- 
ing organisation for its servers (see Figure 1): 

1. Master Meta Server (MMS) and Regional Meta Server (RMS). These servers 
constitute a distributed instance of the Meta Service. In particular: 

— the MMS stores information on all the institutions participating in a 
collection and on the servers that act as regional metaservers. It also 
stores a set of collection views, i.e. perspectives on collections specific 
to a region; each collection view identifies the list of index servers that 
should be used by the Dienst Standard Site servers in that region to 
process a query. 

— the RMS supplies its regional servers with the list of publishing institu- 
tions that form part of a collection, the list of servers implementing the 
index server to be contacted to process the queries, the list of servers that 
implement the repository service to be contacted for document retrieval. 
The metadata collected by the RMS are retrieved through regular calls 
to the MMS. 

2. Dienst Standard Site (DSS). This server instances the functionality of the 
Repository Service, the User Interface Service and the Index Service for its 
own digital documents. Each DSS refers to just one RMS. For this reason, 
each DSS belongs to a single, specific region. When implementing the func- 
tionality described above, a DSS can only communicate with the servers in 
its own region. The only exception is when processing queries directed to 
authorities outside its own region; in this case, the DSS directs its query to 
its MIS. 

3. Merged Indexes Server (MIS). This server instances the functionality of the 
Index Server for all the repository servers that are outside a given region. The 
meta-information collected is used to process queries directed to authorities 
not belonging to that region. The MIS communicates with the RMS in order 
to obtain the meta-information regarding its own region. 

Dienst is the foundation for NCSTRL but its developers state that it can be 
extended and/or specialised to meet the needs of other applications. In the de- 
velopment of ETRDL we have been able to test this claim. 
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Our aim has been to implement the additional functionality required by the 
ETRDL services while maintaining compatibility with Dienst, in order to keep 
our status as a partner in NCSTRL. Thus, exploiting the Dienst characteristic 
of “openness”, we have: 

1. specialised the existing Dienst services; 

2. extended the Dienst protocol, including a number of new service requests 
and specialising some of the existing ones; 

3. extended the Dienst architecture, adding a new service. 

4 Implementing ETRDL 

Lagoze in [8], defines a DL as “a managed collection of digital objects (content) 
and services (functionality) associated with the storage, discovery, retrieval, and 
preservation of these objects” . In this section, we discuss how the ETRDL collec- 
tion and its services have been designed and implemented in the ERCIM digital 
library. It should be remembered that we had two objectives when constructing 
ETRDL: on the one hand we wanted to satisfy the requirements of the ERCIM 
users; on the other we wanted to ensure our non-isolation from the rest of our 
scientific community. 

4.1 The ETRDL Collection 

A DL collection has been logically defined as: “a set of criteria for selecting re- 
sources from the broader information space” [8]. Following this definition, we 
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have characterised the ETRDL collection by the following criterion: the set of 
documents whose publishing institutions have names that begin with the string 
“ercim” . Obviously, a necessary assumption is that all the institutions partici- 
pating in ETRDL have named their own collections according to this rule. 

In accordance with our requirement of non-isolation, this collection must 
be visible both to ETRDL and to NCSTRL users. Furthermore, ETRDL users 
must be able to access the specific ETRDL services. The intuitive way to achieve 
this appeared to be to implement the ETRDL collection as a specialised sub- 
collection of NCSTRL. However, Dienst only recognises as sub-collections sets 
of documents belonging to a single publishing institution. The problem was 
that the ERCIM users had decided not to concentrate their collections into a 
single publishing institution for several reasons: i) each member institute wanted 
to manage its own documents, ii) a distributed management facilitates local 
specialisation, e.g. of the interface, submission policies, administration services, 
etc. It was thus necessary to find a way to add a collection to NCSTRL that was 
distributed over several publishers. 

The solution we have adopted satisfies all but one of these conditions: the 
ETRDL collection is not visible as such to NCSTRL users who access from non- 
ERCIM sites. These users can access documents of the ETRDL collection, using 
the NCSTRL discovery tools, but they see them as belonging just to the local 
ERCIM publishing institutions, not as part of the larger ETRDL collection. On 
the other hand, the ETRDL interface provides the ERCIM user with a choice 
between three collections: the NCSTRL collection, the ERCIM collection and 
the collection of the local ERCIM institution (see Figure 2). 

Users selecting the NCSTRL collection will access the standard NCSTRL 
search and browse functionality; those selecting, the ETRDL collection will ac- 
cess the specialised ETRDL services, while users accessing their local collection 
may have extra services available (e.g. an interface in the local language, addi- 
tional discovery mechanisms, specialised administrative features). 

In order to be able to constitute the ETRDL collection (and overcome the 
constraints imposed by Dienst), we had to choose between two alternative im- 
plementation strategies: 

1. Introduction of a new specialised ERCIM Meta Service (EMS) in the archi- 
tecture through the implementation of dedicated ETRDL meta-servers in 
each region. This service would provide mechanisms to identify the ETRDL 
index servers from the other index servers and to manage meta-information 
about these servers. Each EMS would retrieve information from the RMS of 
its region and simulate the behaviour of the RMS with respect to the ETRDL 
DSSs. The EMS must be transparent to the non ERCIM DSSs which con- 
tinue to refer to the RMS. This approach implies implementing a number 
of extra servers, potentially one for each region. If, for example, an ERCIM 
partner registers in a region that does not contain other ETRDL servers, a 
new EMS must be introduced in that region. For this reason, this approach 
was judged impractical as it would considerably increase the work load. 




350 



A. Andreoni et al. 



iNetscape: lU-emVLt 



I0H 




Fig. 2. The ETRDL home page 



2. All the ETRDL servers must belong to the same region and communicate 
with the same RMS. The mechanism to identify the ETRDL index servers 
would thus be automatically guaranteed by the architecture. This solution 
requires that the ETRDL servers currently distributed throughout Europe 
would have to migrate to a single region and refer to a single RMS. This 
solution would appear to be in conflict with the principle underlying the im- 
plementation of the regional meta servers, created to guarantee connectivity. 
However, as there is generally an acceptable level of connectivity in Europe 
and as this solution is practically cost free, we have decided to adopt this 
strategy. 



The adopted solution, which is based on the current static state of the regions, 
could result unsatisfactory if there are future developments of the architecture 
with respect to a dynamic management of regional membership in function of 
connectivity. We are thus investigating the feasibility of creating a mechanism for 
auto-identiflcation of ERCIM index servers. Unlike the above approaches that 
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impose constraints on the system architecture, this solution would only involve 
modifying the ETRDL DSS. 

In the next section, we describe how we have implemented the set of spe- 
cialised ETRDL services. 



4.2 The ETRDL Services 

There are three main classes of ETRDL services: 

1. search and retrieval 

2. submission/ withdrawal of documents 

3. DL administration 

As we have already stated, the aim of ETRDL is to provide its users with a digital 
library service satisfying their requirements. We have thus made the following 
extensions with respect to NCSTRL. The ETRDL search and browse service 
offers additional functionality by implementing subject searching and browsing, 
and providing users with local language interfaces. The submit /withdraw service 
is new, and aims at assisting the authors by providing facilities to classify their 
documentation (using classification schemes for both computer science and the 
mathematics) quickly, easily and correctly. The administration service is also 
new, and assists the librarians by providing mechanisms to manage the digital 
documentation efficiently. 



Search and Retrieval Search operations rely on matching between user queries 
and document descriptions; the richer the descriptions the more successful the 
search. The standard metadata supported by NCSTRL {author, title, abstract) 
was not sufficient for the implementation of the more powerful search and browse 
functionality requested by the ERCIM users. The first step was thus to define the 
set of metadata to be associated with the documents in the ETRDL collection. 
We have selected a metadata format which represents an extension of the basic 
NCSTRL metadata set, in order to ensure NCSTRL to ETRDL interoperability, 
and which is also compatible with the Dublin Core metadescription standard 
[9]. The decision to comply with the Dublin Core standard was made so that 
integrated queries over different digital libraries using the DC metadata would 
be possible in the future. 

The new fields introduced into the ETRDL bibliographic record are subject; 
type, date and language; local language abstract. 

The inclusion of subject fields makes it possible to overcome a serious defect 
in most current retrieval systems. In recent years, the diffusion of powerful search 
engines capable of indexing full document text has lead to the idea that the task 
of subject classification, which requires much intellectual work by librarians, is 
perhaps obsolete and can be substituted by free keyword searching on indexes of 
document terms. The disadvantages of this type of searching, however - e.g. the 
false coordination of terms, the risk of very low recall values - are well known. 
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Therefore we decided that document contents should also be explicitly repre- 
sented, by the authors, using descriptors from specialised controlled vocabular- 
ies: the ACM (for computer science) or the AMS (for mathematics) classification 
schemes. We also allow our authors to apply free terms, each consisting of one or 
more keywords, for those topics for which classification codes have not yet been 
introduced. It has been found that the frequent use of certain free keywords acts 
as a stimulus for their introduction into a controlled vocabulary. 

The document type, date and language fields have been added to allow the 
selection of particular sets of results, i.e. to refine the results of a search opera- 
tion. The user can restrict the display of results to given document types (e.g. 
technical reports, proceedings, pre-prints, theses, etc.), to documents published 
in a given year, or to documents in a given language (one of the languages of the 
ERCIM member institutions). This is done by selecting the desired type, date 
and language values from a set of pop-up menus. 

An English abstract is mandatory for all documents. Documents in languages 
other than English will also have an abstract in the local language. The local 
language abstract field is used for these documents. 

The Browse Service. The Dienst protocol provides two ways to browse the col- 
lection: 

— by author (all authors, a range of authors, authors whose names begin with 
a given letter) . 

— by year (all years, a range of years, a given year) 

In the traditional library, the user can browse through the subject catalogues in 
order to be acquainted with the material contained and to see what is available 
for a given argument. We wanted to provide a similar facility in ETRDL to give 
the users a starting point to investigate the contents of the collection and thus 
improve the precision of their queries. We thus added a new function: browse 
by subject terms (all terms, a range of terms, or terms beginning with a given 
character) . 

The browse function is implemented in the User Interface (UI) Service of the 
Dienst protocol. The introduction of the new service has implied modifying the 
Index Server, in order to create and index the subject terms, and extending the 
UI Service, to access the functions of browse by ACM and AMS classification 
codes and browse by keywords. 

The Search Service. The Dienst protocol provides two functions for querying 
a collection: Simple Search and Fielded Search. The Simple Search returns all 
documents whose author, title, or abstract contain any of the terms entered. The 
Fielded Search takes as input a complex condition. Logically, it can be seen as 
decomposed in two parts: the set of publishing institutions on which the search 
is to be performed and the condition to be imposed. This is defined in terms of 
the author, title and abstract search fields, and by a boolean operator “AND” 
or “OR” linking the simple conditions specified on each field. 
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In designing the ETRDL search service, we have attempted to create a ho- 
mogenous work environment for all users. Thus a user accessing the ERCIM 
digital library can choose to search the entire NCSTRL collection or only the 
ETRDL sub-collection. Both provide certain services and present them to the 
user in a certain way. If the service is to be efficient, then it is important that 
the user enters a familiar work environment, independently of the collection he 
has selected. For this reason, we have maintained the division introduced by 
NCSTRL between Simple Search e Fielded Search, even though we have modi- 
fied them. 

However, the search functions provided by NCSTRL were insufficient with 
respect to the requirements of the ERCIM users. In particular, for the field search 
we wanted: 

1. to have a richer set of access points in order to raise the level of recall and 
precision; 

2. to be able to build sets of results in order to support an incremental formu- 
lation of the query; 

3. to be able to express more complex search conditions. 

The first of these requirements derives mainly from the need to be able to search 
by subject - generally the primary need of a user of a library and, in our opinion, 
also of the digital library user. The second reflects the need to allow users to refine 
the results of a query, in order to achieve a higher precision. The third arises from 
the fact that with Dienst queries the relationship between the conditions imposed 
in different fields must always be the same, either “AND” or “OR”, never a 
combination of the two. For example, the query: (Author = “Sebastian!” AND 
(Title =■* “maintenance” OR Abstract = “maintenance”)) is not acceptable. 
This limitation is made more restrictive by the fact that it is not possible to 
maintain a result set for a first query, on which to formulate a second query. 
Thus a query of this type cannot be executed even in two steps. Furthermore, it 
is also impossible to apply held selectors (e.g. on the language of the documents 
to be retrieved, or on their date, etc.) using the Dienst search mechanism. 

Figure 3 shows the User Interface for the ETRDL Fielded Search. ^From 
the figure it can be seen that Fielded Search has three logical components: the 
bibliographic fields (Author, Title, Abstract, Subject) and two radio buttons to 
specify whether the values entered in the fields should be “ANDed” or “ORed” ; 
three selectors to Alter documents according to Type, Date, Language; a menu 
to select one or more collections on which to perform the search; and a check 
box to select all collections. 

In order to meet the requirements specified above, we have modified both 
the metadata associated with the ETRDL document with respect to NCSTRL 
and the Dienst protocol. 

In doing this it was essential to maintain interoperability with NCSTRL. The 
service requests sent from servers external to ETRDL had to be accepted and 
the semantics of the Dienst protocol had to be respected. 

By = we intend a matching of the input with indexed strings in the bibliographic 
field that contain the keyword entered as substring, in whatever position. 
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Fig. 3. The Fielded Search 



To obtain this goal, we have modified Dienst respecting a principle that we 
have called specialisation. Given a service request SR{pi, . . Pn), where SR is 
the name of the service request and Pi,..., Pn are its parameters, that returns a 
result R and has a behaviour B, we define as the specialisation of SR a service 
request that satisfies the following conditions: - has name SR - has parameters 
Pi , Pn, Pn+i, ■■ ■ , Pk where Pfc+i, . . . , pfc are optional - the type of result 
returned is a subtype of i? - the new behaviour has the same effect as B when 
the service request is invoked by the parameters pi,. . p„. 

For example, the Dienst service request that we have modified most exten- 
sively while respecting the above principle is the SearchBoolean. This is a service 
request of the Index Service. In its original version, it takes as input parameters 
a boolean operator and a list containing one or more of the author, title and 
abstract fields, each of which may have an associated value. A list of records is re- 
turned, each record contains the title, author, date, and the URN of a document 
satisfying the conditions expressed by the service request. We have specialised the 
service request by extending the list of input parameters with additional fields: 
subject, type, year and language. These last three can be employed as selectors 
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on the result set deriving from a search imposed using the other parameters. 
The behaviour of the specialised service request is the following: 

— if none of the additional fields has been specified, it is equal to the behaviour 
of the original SearchBoolean 

~ if the subject field has been specified but no selector is indicated, it is similar 
to the behaviour of the original SearchBoolean with the difference that the 
condition imposed as value of the subject field is also evaluated 

— if at least one selector has been specified, it first evaluates conditions imposed 
on the author, title, abstract, subject fields; a temporary transparent result set 
is constructed to which the conditions specified by the selectors are applied. 

It must be observed that the function implemented in this way does not satisfy 
our initial requirements as it still does not allow us to specify certain signifi- 
cant boolean structures, such as queries in which the relationship between the 
conditions imposed in author, title, subject and abstract search fields are a com- 
bination of the “AND” and “OR” boolean operators. This problem cannot be 
overcome with the simple specialisation of the Dienst reference implementation. 
In the new version of ETRDL now under implementation we are attempting to 
resolve this problem by extending the protocol. 

The implementation of the ETRDL search service has also necessitated changes 
to the Index Service in order to support the indexing of the new fields and their 
management. 



Submission and Withdrawal of Documents In the Dienst reference imple- 
mentation used by ETRDL, in order to submit a new document to the collection, 
the user must: 

— fill in a Web-based bibliographic record for the document 

— submit the document in a file via FTP 

This means that the user is obliged to use separate applications (the web browser 
and the FTP client) for a single submission operation, and the administrator 
has to maintain an ftp site for new document files and must manage document 
insertion in the collection by a shell command line. 

In ETRDL we wanted to offer our information providers with a single work 
environment for the compilation of the bibliographic record and the submission 
of the document in the chosen format (PS, PDF, TXT, HTML, TIFF). The 
main features of this work environment are a simple and controlled access and 
a guided compilation of the bibliographic document. 

We have thus introduced a new service (Administrator and User Interaction 
- AUI service) that supplies mechanisms for submit and withdraw of documents 
and for the DL administration. This service has been implemented by a set of 
modules that interact with the services provided by the architecture through a 
set of calls to service requests of the Dienst protocol. These modules are web- 
based. 
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The submit procedure guides the user when inserting the metadata and file 
associated with the document. The system will perform an automatic check on 
the formal correctness of the contents of the obligatory fields. During compilation 
of the submit form the user can access directly the Computing Classification 
System (CCS) and Mathematics Subject Classification (MSC) codes/descriptors 
(via a hypertextual link to the ACM and AMS sites) . This service operates in two 
stages: the author submits the compiled bibliographic record and the document 
file to the DL; the administrator is responsible for its approval, and the actual 
insertion of the document and its associated metadata in the collection. 

The withdraw procedure guides the user when deleting his documents from 
the ETRDL. This service has been included to support the inclusion in ETRDL 
of temporary documents, e.g. pre-prints that must be removed when the article in 
its definite form is published elsewhere. Similarly to submit, this service operates 
in two stages: the author communicates the document to be withdrawn and the 
motivation to the DL administrator; the administrator is responsible for the 
actual withdraw/delete. The author/administrator communication takes place 
through the automatic generation of e-mail messages. 

Additional services provided by the submit and withdraw procedures are: 
a contextual online help to explain the syntax and semantics of each field; an 
e-mail access point to contact directly the librarian or system administrator if 
extra help is needed; a bilingual user interface in English and the local language. 

Access to the submit and withdraw procedures is controlled and only au- 
thorised users can insert new documents or ask for the withdrawal of existing 
ones. 



DL Administration As mentioned in Section 4.1, the ETRDL collection con- 
sists of a set of independent subcollections; each of which consists of the docu- 
ments produced by a single publishing institution. Each publishing institution is 
responsible for the administration of its own collection. The Dienst reference im- 
plementation used by ETRDL provides basic utilities to help the System Admin- 
istrator to manage the collections, e.g. when inserting or removing a document 
from the local repository. These utilities require that the operator is directly 
connected to the server system and that he/she knows the Unix command lan- 
guage. 

However, in most of the ERCIM institutions the librarians will be responsible 
for verifying the formal correctness of the documents and bibliographic records 
submitted and will assign the identification number. As the librarian is not 
usually a systems expert, the ETRDL administration service had to satisfy the 
following requirements: 

— the interface had to be platform independent - the administrator could access 
the system via a Web browser 

— the administrator is able to check that the documents and bibliographic 
records submitted were formally correct 

— the administrator is able to communicate with the information provider via 
e-mail, if necessary. 
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We have thus introduced a new administration procedure, part of the AUI Ser- 
vice, that supplies mechanisms to administer the DL by means of an Web-based 
integrated work environment. As with the submit and withdraw procedures, 
access to the administration procedure is controlled and limited to authorised 
users. This environment is very similar to that designed for the information 
providers and seekers. 

The administration procedure allows the librarian to insert or delete doc- 
uments from collections to which he/she has access. Whenever an information 
provider submits a new document or requests the withdraw of an existing doc- 
ument, the procedure notifies the librarian through the automatic generation of 
an e-mail. 

In this work environment, the administrator can i) insert new documents in 
the collections he/she is responsible for; ii) eliminate outdated documents; iii) 
limit access to bibliographic information only if the document has been published 
elsewhere; in this case, reference to the publication is given. 

4.3 The ETRDL User Interface 

The extensions we have made to the basic Dienst service described above have 
clearly affected the interface design decisions. The ETRDL collection is a spe- 
cialised sub-collection of NCSTRL, the user interfaces must reflect this. In this 
section, we mention briefly the main issues that have been considered when de- 
veloping the ETRDL user interfaces and outline the difficulties we encountered 
(for a more detailed discussion, see [10]). 

A first major decision regarded the system Home Page(s), i.e. the initial 
access points. ETRDL is an integrated work environment, and thus the Home 
Page must provide access to various types of functionality. We have also had to 
allow for different “views” on our collection: public vs. restricted; centralised vs. 
local. The contents of the ETRDL collection can also be accessed by NCSTRL 
users, but such users can only access the search and browse functionality of 
the separate participating institutions; the ETRDL information provider and 
administrator services are transparent to them. However, for the ERCIM user, 
ETRDL is a distributed collection, consisting of the set of the local ERCIM 
collections. The local collections are maintained on the local servers of each 
partner institution. This has comported the implementation of two levels of 
Home Pages. A centralised access point has been provided to the system through 
the DELOS Web site, whereas a local home page is installed on each local server. 
The “views” provided by these two different Home Pages respect the needs of 
the potential users at each site (centralised and local) and thus provide different 
points of entry. 

The Centralised Home Page is in English only and has been designed for 
IT information users in general, not necessarily from ERCIM (see Figure 4). 
For this reason, it provides links to pages that describe the objectives of the 
ETRDL, to on-line documentation, and to other relevant Web sites. It allows 
the user to access the ETRDL through one of the local servers. Clicking on the 
logo of a given institution will open the relevant local home page interface. Our 




358 



A. Andreoni et al. 



initial intention was to provide direct access to the ETRDL collection (with the 
extended set of functionality) from the centralised Home Page. However, it was 
decided that this was not realistic; it implied maintaining a centralised server 
as well as the local ones. The user is thus informed that in order to search the 
ERCIM DL collection, he should select one of the local servers. At the same time, 
he is given a choice of language as each local server will maintain interfaces in 
English and in the local language (see Section 5 below). 




Fig. 4. The Centralised Home Page 



The Local Home Page interface caters simultaneously for two user classes: in- 
formation users and information providers by offering two options: search/browse 
any collection; submit/ withdraw a document to/from a local collection. From the 
local home pages, the search and browse functions can be activated over the en- 
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tire NCSTRL collection, over the ERCIM collection, or over the collection(s) of 
the local institution (see Figure 2). In each case, the user is not only accessing a 
different collection (or sub-collection), but is provided with a different perspec- 
tive on the information, depending on the functions that have been implemented 
at that particular level. When searching on the ETRDL or the local collections, 
the user can switch between user interfaces in English or his/her own language. 
Online helps in both languages are available. 

The Administrator Home Page is transparent to the general public and ac- 
cessible by authorised persons only. The main functions to be provided by the 
administrator interfaces were decided and defined in agreement with all the part- 
ner institutions and are described above. However, no common administrator 
interfaces have been designed; each local institution implements them according 
to local requirements. 

We encountered a number of problems when implementing our interfaces, 
mainly due to the fact that the UI Service of the Dienst protocol was not designed 
to support a multilingual interface. This meant a specialisation of all the UI 
Service modules. The aim was to simplify the operations of localisation by the 
ERCIM partners. The UI Service has thus been rendered parametric in function 
of language (see 5.1 below). This operation of specialisation has been complex 
as the reference implementation of Dienst architecture incorrectly conflates the 
functions of the user interface service with query routing. 



5 Implementing a Multilingual Interface 

One of the points that most distinguishes the ETRDL service from that of 
NCSTRL is the need to handle multiple languages. As already mentioned, the 
ERCIM scientific community currently consists of 14 national institutions speak- 
ing 13 different major European languages, and multilinguality is thus an issue 
of great relevance. While it is true that a considerable proportion of the technical 
documentation in the institutions is produced directly in English, provisions had 
to be taken to enable documents in languages other than English to be included 
in the collections, and to en able users that do not have a high competence in 
English to use the system in their own language. 

As far as ERCIM is concerned, this is not only a practical but also a strategic 
issue, going beyond the restricted domain of computer science. The diversity of 
the world’s languages and cultures gives rise to an enormous wealth of knowledge 
and ideas. It is thus essential that we study and develop computational method- 
ologies and tools that help us to preserve and exploit this heritage. The ETRDL 
collection constitutes a very convenient test-bed on which study technologies for 
multilingual information access. 

Two basic issues are involved: 

1. Multiple language recognition, manipulation and display. 

2. Multilingual or cross-language search and retrieval. 
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The first activities of ETRDL in this area have been aimed at (i) implementing 
user interfaces capable of handling multiple languages and (ii) providing very 
basic functionality for cross-language querying. 

5.1 Multilingual Access 

It was decided that each national site should be responsible for localisation, i.e. 
implementation of local site user interfaces in the national language as well as 
the CUI in English. At the very simplest level, this means translating the com- 
mon system interfaces (including the on-line helps) into the local language. For 
the system home pages, at each local site we maintain a version in English and 
in the local language; the user can choose which to activate using the language 
link on the local home page. All the other interfaces of the system are gener- 
ated automatically during run-time. The system code thus includes a language 
variable, which determines whether the procedures should invoke interfaces and 
system messages in English or in the local language, depending on the initial 
choice made by the user. Of course, localisation also implies providing the meta- 
data field descriptors in the local language as well as in English. The group has 
thus been involved the activity of the Multilingual Dublin Core group [11] and 
the descriptors employed in each local language will conform with the decisions 
taken by this group. 

More complex at both the interface and the system level is the question of 
being able to handle and visualise multiple character code sets. Each document 
submitted to the collection is tagged for language. Mechanisms will be provided 
for the local display and printing of non-Latin- 1 languages (this has been im- 
plemented at ICS-FORTH but is not yet operational on-line). In the future, we 
must decide whether to move to Unicode. 

5.2 Querying in Languages other than English 

At the level of the local collections, users must be given the opportunity to 
formulate queries in the local language and restrict their search to documents 
in that language. We are thus implementing mechanisms for the indexing of 
documents in languages other than English. This implies managing non-English 
sets of characters and stop word lists. Another question to be tackled is that 
of handling accented characters; the Latin- 1 character code set caters for most 
European languages and is thus able to encode and represent all the accented 
characters of these languages. However, European users are not always able to 
input all of these characters on their keyboards. This is not a problem for local 
language querying but can become a problem when querying over the entire 
ETRDL collection for authors with “foreign” names, e.g. a user querying for 
documents with Author = Muller, might enter Muller, Muller, Mueller. We will 
have to study and implement robust search and indexing mechanisms for the 
author field to handle such cases. 

Cross-language Querying. A simple form of cross-language querying is already 
possible using the controlled vocabulary (ACM/ AMS) terms. All documents in 
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the ETRDL, in whatever language, classified using this scheme, can be searched. 
As authors are also requested to include an abstract in English, English free 
term searching over documents in any language is also possible. INESC has 
developed an LDAP service with a multilingual repository for the ACM and 
AMS classification systems (currently implemented in English and Portuguese), 
which they intend to integrate in their version of the ETRDL system [12]. We 
are now investigating other strategies for cross-language querying. 



6 Next Steps 

In this paper, we have described the first implementation of the ERCIM digital 
library developed as part of the NCSTRL network, employing and adapting the 
Dienst infrastructure. Our reference implementation of Dienst provides a sim- 
ple, monolingual free-text search service. We have extended and specialised this 
service by adding controlled vocabulary search facilities, multilingual interfaces, 
mechanisms for the guided online submission and withdrawal of documentation, 
and for the administration service. Our aim has been to go a step further than 
NCSTRL, offering our users a complete set of digital library services integrated 
in a homogenous work environment, or “work centre” according to the concept 
introduced in [13]. The difficulties we have encountered have been mainly caused 
by our desire to provide this specialised DL service within a much more exten- 
sive network, offering lesser functionality. The need to guarantee compatibility 
with NCSTRL has meant that it has not been possible to satisfy all our initial 
requirements and certain compromises have had to be accepted, above all the 
fact that the ETRDL collection cannot be viewed as such by NCSTRL users. 

As stated in the paper, our reference implementation has been Dienst, version 
4.1.9. However, a new (and final) version of Dienst has now been developed 
at Cornell. This version provides functionality to order the results (including 
ranking). NCSTRL has adopted this new version. If we want to maintain the 
same level of compatibility with NCSTRL, we must produce a new version of 
ETRDL which incorporates the new functionality. 

In any case, our intention in the future is to continue in the direction of 
a dedicated “work centre”, implementing a series of more sophisticated user- 
oriented services. These include tools for semi-automatic document classification, 
procedures for free-text non-English and cross language querying, mechanisms 
for indexing and querying mathematical formulae, systems for document filtering 
and user profile modelling, procedures for the automatic classification of existing 
collections, gateways to other online digital libraries and catalogues for related 
areas of interest. 

Some of these procedures are already in an advanced state of development, 
e.g. the semi-automatic classification tool and the user profiling, others are still 
at the level of “wish-list” . The existing ETRDL collection functions as a test-bed 
for the study and development of these advanced services. 
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Abstract. This paper briefly describes both organizational and tech- 
nical issues and approaches involved in creating an operational digital 
library at the University of Crete, found at http://dlib.libh.uoc.gr. 
We investigate and describe our approaches and experiences, the last few 
years, on setting in operation a Digital Library with many collections. 
We had to analyze the library goals and user needs, to select appropriate 
software, to make flexible design for the additional functionality needed, 
to adapt and extend the selected software to make it applicable to the 
current demands, to install and configure the software, to improve it 
using feedback, and to interact with document authors and librarians 
to make the digital library friendly, usable and easily maintainable, and 
even to collect and digitize the library material. The final system is 
operated by current library personnel. 

The main technical issues are related to the design, implementation and 
application of features of digital libraries, such as multilingual storage 
and interface, generalization of the software to permit searching on het- 
erogeneous collections, adding support for the Z39.50 protocol and tools 
that simplify the configuration, administration and data insertion to the 
digital library, as well as tools to input or modify the metadata and to 
upload data, when submitting new documents in the digital library. 



1 Introduction 

Although the spread of digital libraries is increasing [12,7,8], most related issues 
only have occasional working approaches. To find appropriate approaches for 
many issues and to put them together on a real operational environment is a 
hard job. 

The digital libraries hold material online, on a digital form, and provide 
advance ways of searching and material retrieval, access and presentation. With 
the term material we mostly refer to both data and metadata (or “data for 
data”). The digital libraries support distributed collections of digital data all 
over the world. Users of the libraries can recall from their computer the data 
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they are interested in, and study digital copies of them without ever having 
to visit the building of the library itself. The digital library can contain text, 
picture, sound, video, etc objects, or mixtures of them. 

We first overview the operation of a digital library, we will use DIENST as 
a specific example. DIENST [4,5] is the digital library system used in the Net- 
worked Computer Science Technical Report Library (NCSTRL), and developed 
in Cornell University and is used as a search tool for distributed digital libraries. 

Every object that is registered under a DIENST server is available in some 
digital formats and has metadata (like title, creator and abstract) associated 
with it. A user can browse through all objects registered on a server, where a list 
of descriptions with links to the documents is presented to the user. DIENST 
is also using the metadata on user queries, to locate the relevant objects, and 
provide the user with descriptions and links to them. If the user selects the link 
to an object, all available information (metadata) about the object is presented 
to the user, together with a list of all available digital formats that the object 
is available. The user can select specific formats and retrieve them locally, to 
access the object. 

The formats that DIENST provides consist of either a single file, or multi 
files, which may describe text, pictures, sound, etc. Each file may represent 
either the full object, or parts of it (e.g. pages of a document). DIENST can also 
easily be augmented to support additional formats, using simple configuration 
descriptions of the formats. 

When a request is sent to a DIENST server, the server examines the request, 
and if the reply refers to objects stored on other servers, too, the corresponding 
parts of the requests are forwarded to these servers (using the DIENST protocol) 
and all their answers are collected. Then, the merged answer is sent back to the 
user that issued the query. 

DIENST is only using one language (English) for storage and data, metadata 
and form presentations. It is also oriented towards a specific type of collection, 
with the metadata fields hardwired into the software. It is using its own server 
communication protocol, which is open, but cannot interact with other appli- 
cations. Finally, it includes no tools for easily adding new documents into the 
digital library. 

On the next section we will present the most important organizational issues 
that we faced, and on the following sections we will examine the main technical 
issues that we faced and on which we contributed interesting solutions: multilin- 
guality, interoperability among heterogeneous collections, interoperability with 
other systems, and creator and administrator interface. Our approaches are gen- 
eral enough to be applied to most other similar situations that we envision that 
digital libraries will be used on the near future. 

2 Organization 

As a part of library infrastructure project, we wanted to build a digital library at 
the University of Crete. This library would give remote access, using WWW, to 
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university material, such as master and doctoral theses, departmental collections, 
such as an archive of old maps, other university collections such as audio archives 
with talks on special events, etc. As this was a big project, many people were, 
part-time, involved, and work was shared with other projects. The coordination 
and interconnection of these people and activities was quite a complex job. 

The organization of a project of that scale is in fact harder than solving 
specific technical problems. We will present the design choices we faced and our 
final solutions. Most of these problems could be approached with far too many 
ways, and many researcher groups may work on each one of them. We do not try 
to supersede their work, but to simply provide an adequate working solution, 
that performs well in our case. Where possible, we followed established and 
proposed standards. Then, our digital library can be used as a working example 
for study of the applicability of the specific solutions to each of these problems. 



2.1 Requirements 

In order to build a digital library for the university material, a lot of problems 
had to be solved. We first had to decide on the functionality of the whole system, 
the software to be used, and even the schedule of our activities and the initial 
content of the digital library. The University has a lot of useful material, that 
would like to make available on the digital library, but even the order of adding it 
on the digital library may be of strategic importance: some people do not easily 
accept the idea of exposing their personal work that widely. To help accumulating 
the material, the university society should see and appreciate the advantages of 
creating and using the digital library. Thus, we had to demonstrate a working 
environment quite fast. 

The University departments are on many different locations, even cities, and 
the distributed aspect of the material of the digital library is essential. Also, we 
wanted to use software with open standards, so that we can convert our digital 
material, data and metadata, to other formats at anytime, for use with more 
modern software, that will surely appear in the next few years, and we are not 
tied to products of a specific software vendor. 

The main purpose of this project is to make current and future university 
material available in digital form. Thus, we would like to avoid making new 
software that will be used only once, in this project, and will need a constant 
effort to maintain, and to use as much as possible ready-to-run already-developed 
code, or at least to share code with projects on progress, improving and extending 
it as needed. Any new code that will be needed for this project should be designed 
to be reusable and extendable for many similar requirements and future projects, 
ours and of others, or for public use. 

Soon after the start of the project, we had to show results, set and satisfy 
milestones and go through evaluation processes. Thus, we should make a quick 
start, and make use of temporary working replacements for procedures and soft- 
ware that would be developed later on. This complicated our initial work, and 
produced a lot of otherwise unnecessary load, but was unavoidable in order to 
satisfy the demands of the project specification. We will only describe the final 
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approaches here, in the order we had to conclude them, as they depict better 
the long run picture of the digital library operation. 

2.2 Workplan 

We divided our work in three stages. The first stage advertises the digital li- 
brary and illustrates it to the university society, with main objectives to make 
most of the user interface and functionality, to collect initial material, to start 
investigating the user needs and the existing material, and to make the initial 
project schedule design. The second stage establishes procedures and finalizes 
the software, by adding all useful functionality, such as tools for object creators 
and collection administrators, interoperability of heterogeneous collections and 
other software. In collections that the documents are submitted directly by the 
creators, specially defined tools are needed. The third stage completes the con- 
tent of the digital library, by digitizing material that is in paper and not yet 
in digital form. At the end of these stages, the digital library should be able to 
operate with minimal overhead. 

We decided to first collect material that will be of high demand, and is easier 
to start collecting: master and doctoral theses. We also selected the DIENST 
software, which has a simple and friendly WWW user interface, adequate for 
most user needs. 

One of our first goals was to ensure the completeness of the material of the 
resulting digital library, as much as possible. It is the completeness of a library 
that makes the library precious. 

We had to make procedures that will ensure that the already written material 
will be collected, and also no new theses will be lost. We started an effort to 
collect the completed theses in digital form, and avoid making a lower quality 
paper-scanned version of them, by contacting the graduated authors. 

We had to study the format of the currently available documents, and decide 
on reasonable and convenient requirements that we should pose on future doc- 
uments. We proposed to the University the requirements and procedures that 
were needed for getting digital copies of all new theses, and they were adopted. 

We organized the digital material into many separate collections. Each of 
these collections contains logically correlated material and can be addressed 
or excluded on user queries, and can also be stored in different computers or 
locations, or managed independently, by different collection administrators. 

Our investigation concluded that there is much more exportable university 
material, that is candidate for the next offered collections: selected diploma 
theses, technical reports, working papers, archives and descriptions collected 
by the department of History and Archaeology, material from the university 
museum of natural history, video tapes and photographs from various university 
activities, archive of Cretan literature, archive of old maps, videotaped courses, 
academic programs of the departments over the years, bibliography of the courses 
offered over the years, university announcements, incoming and outgoing public 
paperwork, various journals and magazines published by the university or its 
alumni unions, electronic profile of university personnel, etc. A more detailed 
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exploitation, with more personal discussions, will bring into light even more 
material that is either needed in digital form, or can be easily available in digital 
form! 

We had to finger out the most common document formats, make the soft- 
ware aware of them and make tools to convert other formats into them. More 
formats were added to the DIENST known format specification. Among them, 
a new html format was developed, that also includes an internal set of files, 
where the hyper links can refer to, and can incorporate a variety of web format 
combinations. 

We also had to install servers, to gather data and metadata, to cross check 
the master and doctoral metadata with official university records, to provide 
user’s and reference guides, and to train personnel to use, configure and maintain 
the software, on its permanent operations phase. During this work, the issues 
examined in more detailed in the following sections become apparent: 



2.3 Technical Challenges 

DIENST is using the English language for the NCSTRL system - in the Com- 
puter Science field, the English language is well established. To handle large 
collections of documents described into many languages and to increase the ap- 
plicability and usage of digital libraries on non English speaking countries, a 
multilingual design is needed [9] - this is much harder than a single language 
interface. 

The library offers many collections, and each one of them has its own meta- 
data fields, so that it can be searched intuitively. A master theses collection may 
have an author field, while a map collection may have dimensions fields. Both of 
them have a title, and we should be able to search all collections together when 
we want to search by title, and other common fields, but we should also be able 
to specify dimensions when searching only map collections. As collections may 
be added to the digital library at any time, we should not need to change or 
even reconfigure the remote servers, we should have mechanisms to know which 
are the metadata fields of any collection. This is collection interoperability [1]. 

We want to provide access to a DIENST digital library via the Z39.50 proto- 
col, a well-established search and retrieval protocol. We map each DIENST col- 
lection to a Z39.50 database and use one Z39.50 server for each DIENST server. 
We directly support different metadata fields per collection, and the metadata 
fields may contain multilingual information and we provide hyper links to the 
digital data [13]. Any Z39.50 client can be used to access the digital library. 

We built tools with WWW interface that can be used by thesis authors to 
directly upload the digital documents and submit the associated metadata, and 
by collection administrator to input or modify the metadata as well as inspect, 
commit or modify submissions [3]. 

Among other smaller contributions, we extended DIENST so that individual 
documents can also be stored on remote locations, and the digital library can 
provide and use a remote link to the document. 
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3 Multilinguality 

The design and operation of the existing multilingual interfaces on WWW is 
affected by current limitations, and no generic interface approach is implemented, 
that uses a generalized solution under the existing restrictions from protocols and 
software. Most current limitations on the WWW-client are related to the limited 
number of character sets that HTML pages can refer to. 

We had to design the multilingual data storage and handling, and the multi- 
lingual client interface design. Our multilingual extensions to DIENST can sup- 
port queries and browsing of the collections of a digital library, in many different 
languages, and have been designed to be easily extendable to accommodate new 
languages. Our multilingual design is reusable - most of its components do not 
depend on DIENST, and may be appropriate in other similar designs. 

All digital information, the data and especially the metadata, should nor- 
mally be available in all languages of interest, so that they can be used both 
for searching and locating the appropriate information, according to the user 
specification, and for presenting it to the user. If translations are not available 
in some languages, the document displaying would be affected, but the retrieval 
functionality should not be restricted. 

For example, if a search is based on a creator name written in English, docu- 
ments that have no English translation of their creator names will probably not 
be included in the results. On the other hand, a listing of documents presented 
to the user in the English language should not omit documents from the list just 
because there is no available translation of their title in English. 

Each document in the digital library can be stored in different formats, cor- 
responding to different detail (e.g. resolution) or standards, and can either be 
expressed on a human language, e.g. text and speech, or can be independent of 
any language, e.g. picture and music. In the first case, multiple representations of 
the document, one for each language, are desired. When the user asks to see the 
document, he will usually select one of its available representations, according to 
his preferences. Thus, these different representations are handled just as different 
formats of the document, that can appear even in single language designs. 

The contents of the document is not the only way of specifying searching 
criteria for it. More information that describes it is used, the metadata. To 
indicate the close relation of a document, all of its available formats, and its 
associated metadata, we call them together a (searching) object. 

The metadata fields usually contain text, and must be available in each lan- 
guage of interest, to provide full searching and display functionality for the docu- 
ment. Even on documents that are independent of any language, their metadata 
are not, and all multilingual problems apply to the metadata, too. 

In a WWW interface, the usual WWW forms are written in HTML, and are 
limited by the HTML directives. The fonts that are specified for the display are 
constructed to correspond to a specific character set. Each font and character set 
can represent up to 256 characters. The characters are grouped into sets accord- 
ing to usage, so that most languages (like the western European languages) can 
be represented by one character set. If the display must contain more than 256 
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characters, or characters from more than one character sets, the above scheme 
is inadequate. 

Multilingual user input is also needed. In most cases, user input is expressed 
in one language that can be covered by a single character set: contains ASCII 
characters and specific characters of some languages. Using a user selectable 
character set and font, the user is able to provide multilingual input. In more 
complex cases, where characters from more than one character sets must be 
input at the same time, different techniques must be used. 



3.1 The Storage Structure 

For multilingual operation, multilingual information has to be stored separately 
not only in the data, but in the metadata and program configuration area as 
well. Depending on the storage structure that will be used, the handling code 
must be extended accordingly. 

As each metadata field is usually stored separately, a major decision is how 
to store the new, multilingual, metadata. From the two approaches that seemed 
to be more appropriate, the introduction of new, separate metadata fields for 
the multilingual information, one for each language, is cleaner, as all data and 
their translations are clearly distinct, and seems to be more appropriate for the 
design of a new system, but would require a lot of modifications on an existing 
scheme. 

We selected to use the same metadata fields as in the single language case, 
but to encode the contents of these fields in a way that is easy to be split into 
the contained translations of the different languages. 

Instead of merely using the value of a metadata field, the field contains a 
multilingual string, which includes a substring for each available translation (for 
some or all of the desired languages). We can always use string operations to 
get the substring that corresponds to the desired language. Such multilingual 
strings can be used both in the data and in the permanent program configuration 
information. 

A simple encoding seems to be the most convenient solution for the multilin- 
gual string: A selected character (or a combination of characters) that does not 
appear in the contents of the strings is used as a delimiter to separate the trans- 
lations, one for each language. A predefined order is used for storing the different 
translations, and missing translations are denoted by an empty substring. The 
advantages of our approach are: 

— It is simple to isolate the desired translation (substring operation). 

~ If the desired translation is missing, a translation to a different language can 
be easily used, based on preconfigured priorities. 

— The full multilingual string can be propagated to the interface, and the 
language separation can be made there. Thus, the user can change the desired 
language of display without needing the results of a new query. 

— The multilingual string is split into its translations and every translation is 
used without any processing, when constructing the indexes to the metadata 
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field, to produce an all-language common index, so that the user can search 
with keywords on these fields without specifying explicitly the language of 
the keywords. 

— It is easy to add new languages on a running system. 



3.2 The User Interaction 

Users can change language any time they want to, and the current language 
specifies in which language (and character set) the user input is: we have added 
a user selectable option for the language to be used for the display of this page in 
every HTML page of the multilingual interface. When the user selects a language, 
this language will be used for every subsequent HTML page for the display and 
possible user input, until a new language is selected. 

The multilingual interface is used for both searching and browsing. The cur- 
rent language is used for providing language specific selection (e.g. choice on the 
range of names) and language specific sorting (e.g. on sorting by name). Missing 
translations on the browsing key are handled by using browsing keys from other 
languages. 

In the interface, we can add a new language with the least possible modifica- 
tions. Still, concurrent display of many languages, from many character sets, is 
not possible under this kind of interface. More complex display solutions, with 
either java or Unicode, must be applied to give a more general solution to the 
problem. 

Thus, we also developed and provide a unicode-based digital library interface 
to the DIENST digital library software, without the use of Java. The Unicode 
interface can be activated at the first DIENST screen, as its use is normally 
a matter of available resources, and not a personal choice. When Unicode is 
active, a string, before being displayed to the user, is first transformed to the 
corresponding Unicode string. With the Unicode interface, the user is able to 
access documents and see results with text from many different languages in the 
same page. 

4 Interoperability of Heterogeneous Collections 

The DIENST software had to change substantially, to support heterogeneous 
collections, with different metadata: The metaserver, a separate program that 
informs the servers for the existing collections and the places they are available, 
must provide more information now, like the metadata fields of each collection 
and their translations to all supported languages. The query protocol had to be 
augmented to return all matching metadata, in a known order. Furthermore, for 
more query flexibility, we augmented the protocol to accept queries with values 
that can match any metadata field in the collection. This allows users to pose 
simple queries, without addressing a specific metadata field, or to combine this 
field with others. DIENST has two ways of searching: simple and fielded search. 




Issues in the Development and Operation of a Digital Library 371 



4.1 Searching Anything 

In the simple searching of DIENST, the user specifies a single value, that is 
matched against any object metadata field information. In the original DIENST 
code, this mode of operation would formulate a query where all field are or- 
connected. Now that the collections have different metadata fields, the query 
formulation would be collection dependent and much more complex. 

We now use the anything fields for this functionality, for both performance 
and simplicity in query formulation. Although there is no actual such field in the 
metadata, the DIENST server uses this field like any other field, even on queries 
involving other fields, and actually matches its value against any metadata field. 
The indices of this field are the indices of all other fields together. 

Also, as the simple searching is the most common user searching method, used 
in more than 60% of the searches (see [2]), our method provides more efficient 
query evaluation, by providing this extra indexing. Finally, this new field is 
available together with the other fields, in the fielded search form, augmenting 
the query expressiveness. 



4.2 Fielded Search 

In the fielded searching, searching is done based on the values specified by the 
user to match specific metadata fields. The search mechanism tries to match 
these values with the indices held for the corresponding fields. 

When searching more than one heterogeneous collections, the user first selects 
the collections that he wants to search and then gives values for some of the fields 
that are common to all selected collections. If no other common fields exist, then 
only the anything field is available! 

For example let us assume we have three collections. The first has the title, 
author dmd abstract metadata fields, the second has title and author arid the third 
has title and abstract {figure 1). When searching only the first two collections, the 
user interface provides entries for values corresponding to the title, author and 
anything fields only, the common fields of the collections. It would be meaningless 
to search the second collection based on a value for the abstract field. Similarly, 
when searching all collections we can specify values only for the title and anything 
fields and when searching the first and third collections, the title and abstract 
and anything fields are available. 

A new configuration file is used, which describes the local collections, with 
their names and descriptions. It also includes translations of the collection names 
and metadata field descriptions to all supported languages, and the order of re- 
porting the fields in the different types of printouts. Finally, default values for 
metadata fields of collections not explicitly mentioned can be provided, simpli- 
fying the configuration. 
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title — — - title, author title, abstract 

Fig. 1. Cooperation of three DIENST servers with different metadata fields 
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Fig. 2. New DIENST interface with the two additional buttons 
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4.3 Fielded Search Interface 

In order to support collection interoperability, we extended the fielded search 
interface with two additional buttons, Show fields common to the selected col- 
lections and Show fields common to all collections, as is shown in figure 2. 

Also, at any time, the displayed fields, where the user can specify values for 
searching, are always these common to all collections in the collection browser. 
Initially, all registered collections are available in the browser for user selection. 

When the user wants to search in a subset of the available collections, using 
fields that are common to these collections, but not to all collections currently in 
the browser, after selecting the collections, he can use the Show fields common 
to the selected collections button, as in figure 3, to display all their common 
metadata fields. At the same time, the available set of collections on the browser 
is narrowed to only these that have all the displayed fields (which can be wider 
than the current collection selections), so that the user can still change collection 
selection - and specify any displayed fields. 
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Fig. 3. A restriction on the displayed collections increases the displayed fields 



The user can subsequently change collection selection, and possibly select 
Show fields common to the selected collections again to get access to more fields. 
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or specify values for searching. By selecting Show fields common to all collections, 
he can return to the initial state with all registered collections and entries for 
the globally common metadata fields. The selected collections do not change by 
changing the displayed fields, and all selected collections are always present on 
the collection browser. 

In most cases, the users avoid using the fielded searching. As a consequence, 
searching on heterogeneous collections, and its interface, is even less popular: 
Whenever possible, it is better to define collections with always the same fields. 



4.4 Distributed DIENST Operation 

The DIENST software is a client-server application where the users contact and 
query the DIENST server using their web browsers. The results are returned to 
the client application and are then displayed to the web browser. Each object 
of the digital library should be “registered” in the DIENST server, so that the 
server is able to know about its existence and access it when necessary. 

When two or more DIENST servers cooperate in a distributed system then 
the search process dispatches many DIENST copies to query the distributed 
digital library servers in parallel. Each server searches its local collection and 
returns the matching items to the requesting server, to merge the results and 
reply to the user. For distributed operation, a metaserver is needed to provide 
information about the configuration of all other servers. 

To make DIENST able to search in digital libraries with different metadata 
fields, we have to add the names of the metadata fields we operate on as a new 
parameter to its search machine, and to extend the metaserver protocol, so that 
it can also give information about the metadata fields of each server. 

A server can consult its collection configuration file, to find the metadata 
fields of the local collections, and the metaserver, to find the metadata fields 
of the remote collections. In distributed operation, the queries formed and the 
merging of the returned results make use of these fields. 

We implemented a DIENST metaserver (the DIENST distribution does not 
include one) that, in addition to its normal protocol, provides one more request, 
to report information about the metadata fields of each collection to the coop- 
erating distributed servers, using the same configuration information with the 
DIENST servers. 

5 Interoperability with Other Protocols 

DIENST has a built-in mechanism for distributed search. We take advantage 
of the functionality of a DIENST digital library and the features of the Z39.50 
protocol without modifying the code of either the DIENST or the Z39.50 server. 

The benefits from providing Z39.50 access to a DIENST digital library in 
summary are: flexibility on the definition of the metadata fields (the fields can 
have a type); flexibility on the retrieved format of a registered object, we may 
make information available in many variants and we may also provide various 
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levels of structuring, from flat organization to arbitrary DAGs; transparent ac- 
cess to diverse databases, e.g. concurrent access to DIENST and other Z39. 50- 
compatible library systems. 

We have built a Z 39. 50 server that implements this functionality. The Z39.50 
server is an enhanced version of CNIDR Isite; it uses DIENST metadata to build 
a Z39.50 database, accepts a query in order to perform a search in the digital 
library and provides URLs to the digital library objects identified by the query; 
last but not least, the Z39.50 server has the capability to search multiple Z39.50 
servers concurrently and merge the results. 

5.1 DIENST and Z39.50 

Storage organization of a DIENST digital library is hierarchical: the library is 
distributed among various servers, each DIENST server provides access to a 
number of collections, each collection comprises of registered digital library ob- 
jects. We map each DIENST collection to a Z39.50 database and use one Z39.50 
server for each DIENST server. Our terminology is based on the ANSI/NISO 
Z39. 50-1995 document [10], that fully describes the standard. The protocol in- 
teraction is depicted in Fig 4. Our work was the implementation of a mapping 
between a DIENST digital collection and the Z39.50 abstract model. 




Fig. 4. Search and Retrieval using Z39.50 



Typically, a Z39.50 server accesses a database via an API defined by the 
server (not by the Z39.50 standard), which we implemented. A different, generic, 
implementation would translate Z39.50 queries to DIENST requests, having the 
advantage of using the DIENST distributed searching capability. 

However, the fact that DIENST collections have a moderate size and can be 
considered read-only — they change very infrequently, and not by the DIENST 
protocol — greatly simplifies our implementation and permits concurrent access 
to the registered objects without danger for damaging the consistency of the dig- 
ital library. As a result, we can bypass the DIENST server and avoid translating 
from Z39.50 to the DIENST protocol, and the two servers need not be aware 
of each other. This approach is simpler to implement but requires to launch 
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an independent Z39.50 server and needs additional effort to provide distributed 
searching; this effort is not significant. 

One of the design decisions was not to change the Z39.50 server protocol code. 
This decision permits the easy incorporation of additional functionality when 
new Z39. 50-1995 facilities are implemented by Z39.50 servers; e.g. automatic 
configuration of the client according to the DIENST metadata associated with 
each object {explain facility). 

The Z39.50 server that provides access to a DIENST digital library can be 
accessed by a generic Z39.50 client but there is a trend of Z39.50 clients towards 
the web. This trend was the motivation for enhancing a gateway so that it 
generates the search form on the By. A file that contains global information as 
well as a description of the attributes which can be used to formulate a query 
is used by the gateway in order to modify an HTML template and generate the 
search form. 



6 Submissions and Administrator Interface 

The submission tools are normally sharing configuration files with DIENST. 
They provide functionality that will cover even rare requirements and offer many 
optional features, through configuration choices. For example, confirmation mes- 
sages may be desired for destructive actions, or notification messages may be sent 
on specific asynchronous events. 

The functions of the tools can be subdivided into three distinct but also self- 
complementary categories: metadata manipulation, data (digital format) han- 
dling and repository management. We will analyze each of the above separately. 



6.1 Metadata Manipulation 

No data are accepted, unless their metadata are already submitted. Especially 
for DIENST, the metadata file follows the protocol described in RFC 1807 [6], 
which proposes a generic format for organizing metadata. 

This format is simple to understand. However, it can be quite difficult to 
consistently follow for users with little computer familiarization, as in fact are 
the majority of the people that are expected to submit an object in a library. 
Moreover, it can often be frustrating for collection administrators to maintain 
a digital library by editing such files by hand. Furthermore, as a digital library 
provides means to be accessed from the WWW, the submission of metadata is 
done in a similar manner, as in figure 5. 

Also the extensions to manage multilingual objects and to handle heteroge- 
neous collections require different and more complicated metadata. Our imple- 
mentation manipulates metadata in a way that conforms to these extensions, 
providing at the same time a multilingual user interface, and the code that 
depends on the specific protocol in which metadata are stored forms a distinct 
module and can be easily modified to manipulate metadata for different systems. 
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Submit Metadata to the Database 




Fig. 5. Metadata submission by user 



Moreover, RFC 1807 supports multi valued fields attributes in the form of 
repeated attribute value pairs. Thus, if an object has two or more creators, 
multiple lines for this field should be created in the metadata file. By default, 
such fields can be filled in as comma-separated lists. The user can enter, for 
example, the value Greg Karvounarakis, Sarantos Kapidakis as a value for the 
field creator. All these internal transcriptions are transparent to the user who 
wants to submit an object to the library, requiring only the filling of a form with 
the appropriate fields for each collection for the supported languages. 

Some metadata fields (e.g. submission date or CS-TR-version) are automati- 
cally filled with default values. Furthermore, some of the fields have extra restric- 
tions: for example, the field id denotes the collection in which the object is going 
to be placed and also its identification as a library object. While the collection 
name should be chosen by the user among a list of the existing collections, it is 
preferable that the id is automatically generated, to ensure uniqueness. Thus, we 
create an id based on the date of the submission, a serial number and a random 
code (serving as a temporary password, for security). Also, mandatory fields 
must be completed for at least one of the languages supported by the system. 
Finally, only the administrator is allowed to modify specific metadata fields, if 
so configured. 

New submissions of metadata and data are normally placed in a temporary 
repository with the same structure as that of the DIENST repository, rather 
than on the permanent repository. It is also possible to have a hierarchy of 
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temporary repositories and collection administrators. Non administrators can 
only access and modify objects in this temporary repository. When a user tries 
to modify an object which has been placed in the permanent repository and has 
no entry in the temporary repository, a new entry is created to store the modified 
information (based on the original entry in the repository), associated with its 
permanent repository entry. New or modified submissions are later inspected by 
the collection administrator and, if approved, are placed in the repository. The 
creator can also attach contact information and comments about the submission, 
for the collection administrator. 



6.2 Manipulation of Digital Formats 



The tools can even handle complex digital formats like ones that are physically 
formed by a hierarchy of files. A degenerate such hierarchy is the scanned pages 
of a paper document. Apart from the validity of each file separately, we also 
check the ability to put these files in a logical order, and for missing pieces, 
in order to display them correctly. We handle file configurations that have an 
obvious, unambiguous order. 

The digital library objects may be represented in several digital instances 
(formats). For example, a picture may be in different resolutions, and possibly 
according to different image standards, or a document may be accompanied 
by many translations. Several issues arise regarding the way these formats are 
uploaded to the library as well as the way they are organized in the repository. 
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Fig. 6. Submission of a data file by a user 
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After submitting the metadata, the user can repeatedly select data files to 
submit and their type (or select auto detection of the file type) in a form such 
as this shown on figure 6. To upload the files, we use the proposed standard 
RFC 1867 [11]. There are ways to send not only more than one digital formats, 
but to change an uploaded format with a newer version or add new formats some 
time later, as well. 

The description of the supported digital formats is also read from the DI- 
ENST configuration files. The submission process handles both metadata and 
digital formats, and stores them in a way compatible with DIENST, on a tem- 
porary repository. The collection administrator need only move the files to their 
permanent space. This will usually be done using our WWW interface offering 
a minimum effort procedure to inspect and discard or approve and commit to 
the repository each digital format separately or whole submissions altogether. 



6.3 Repository Management 

The administrator interface supersedes the creator interface and can be used to 
overwrite creator entries, or to create new entries, especially in massive object 
entry. Additionally, there are some more actions that an administrator is able 
to perform: 

— Browsing, to see and select a submission on the temporary repository as in 
figure 7. 
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Fig. 7. Selection of a submission by the collection administrator 



— Inspecting — approving a submission: He can inspect submitted objects to 
see if they are approved and placed in the digital library. This action actually 
consists of two parts. 
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First, the administrator has to inspect and possibly modify the metadata 
describing the object, to ensure completeness and validity as in figure 8. 
This is actually similar to the function of the modification of the metadata 
of a submission, and therefore a similar interface is used. 




Fig. 8. Cross check of a submission by the collection administrator 



Then, the administrator checks what digital formats have been submitted 
for this object and, if possible, whether they are valid (e.g. not corrupted). 
At the end, the administrator should either approve the submission, and 
commit it to the digital library as a whole or in part or reject it by erasing 
it from the temporary storage space, or leave it in the temporary space to 
be modified or inspected at a later time. 

— New submission: This action is similar to that for a creator. However, the 
administrator is allowed to intervene to the creation of the submission. For 
example, an administrator should be able to change the id of a submission. 
Therefore, the administrator is allowed to overwrite the automatic id and 
other special fields that are created for the new submission. Moreover, after 
the metadata have been stored and the digital formats have been uploaded, 
the administrator is able to immediately commit the new submission to the 
permanent repository. This way, inspection and approval can be done on one 
stage. 

— Modification of permanent digital library objects: The collection adminis- 
trator is able to delete, modify or replace by a newer copies permanent 
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library objects, similarly to the modification of a submission that appears 
in the temporary space. For repository consistency, all modified or new sub- 
missions are first stored in temporary space. Then, the administrator can 
choose which parts from the older version should be kept, and which should 
be replaced by the modified, or new ones, as in figure 9. 
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Fig. 9. Update of the permanent repository by the collection administrator 



It should be obvious that a major security issue emerges, by the introduction 
of such web-based administrative tools, since they can be used to modify or 
delete objects from both the temporary and the permanent repository. To cope 
with this problem, we applied access control to these tools, by using the access 
control options offered by the web-server. Since the whole application is based 
on server scripts, this is a simple way to ensure that only the persons that should 
use these tools will be allowed to access them. 

7 Conclusions 

Our digital library is easily extendable, and all its code can still be freely used 
by anyone. This is a complete digital library, ready to accept material and to 
operate, in an efficient and friendly way. 

There are still many technical issues that deserve better solutions: we could 
use better communication protocols, improve the distributed performance of the 
system, or use a more general purpose multilingual solution and better interfaces. 
Automatic translations and thesaurus browsing would be a very useful interface 
feature. 

But for a successful digital library, the organizational issues are the ones 
that play a key role, and also define the priority of the technical problems. We 
had to solve several problems, from software and interface design decisions to 
contacting authors for the collection of their material. One must be very close 
to the library, to set its policy directions and to see and satisfy its real needs 
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early enough. We hope that our organizational steps, as well as our technical 
contributions and software, will help others to reach their digital library goals 
easier and faster. 
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Abstract. Z39.50 is a client/server protocol widely used in digital li- 
braries and museums for searching and retrieving information spread over 
a number of heterogeneous sources. To overcome semantic and schematic 
discrepancies among the various data sources the protocol relies on a 
world view of information as a flat list of fields, called Access Points 
(AP). One of the major issues for building Z39.50 wrappers is to map 
this unstructured list of APs to the underlying source data. Unfortu- 
nately, existing Z39.50 wrappers have been developed from scratch and 
they do not provide high-level mapping languages with verifiable prop- 
erties. In this paper, we propose a Description Logic based toolkit for 
the declarative specification of Z39.50 wrappers. We claim that the con- 
ceptualization of AP mappings enables a formal validation of the query 
translation quality and therefore ensures the quality of the retrieved data. 
Finally, it allows to tackle a number of Z39.50 pending issues (e.g., meta- 
data retrieval, query failures due to unsupported APs, etc.) by enriching 
the generated Z39.50 wrappers with a number of added-value services 
such as conceptual structuring of flat Z39.50 vocabularies and intelligent 
Z39.50 query assists. 



1 Introduction 

With the advances in digital processing and communication technologies an 
increasing number of organizations and individuals are using the Internet for 
publishing, broadcasting, and exchanging information all over the world. The 
ability to share, interpret, and manipulate information from multiple sources is 
a fundamental requirement for large scale applications e.g., digital libraries and 
museums. A widely used protocol for searching and retrieving information in a 
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distributed environment is Z39.50 [2]. To achieve interoperability [41], Z39.50 
(Version 3) relies on (i) standard messages, formats, and procedures governing 
the communication of clients and servers (system interoperability), (ii) a world 
view of information as a flat vocabulary of fields, called Access Points that 
abstracts representational details of source data (semantic and schematic inter- 
operability), and (iii) basic textual search primitives to express Boolean queries 
in the form of field- value pairs (functional interoperability). 

In order to execute Z39.50 queries, sources should wrap their actual data 
organization, format and access methods according to the Z39.50 specifications 
established for a specific application, community, etc. These specifications are 
described in the various profiles (i.e. metadata) proposed by national or interna- 
tional bodies (e.g.. Library of Congress, CIMI, etc.). It should be stressed that 
the quality of the established mappings between the source and the Z39.50 view 
of information is fundamental in order to ensure the quality of the retrieved data 
(i.e. accuracy, consistency, completeness, etc.). Unfortunately, existing Z39.50 
wrappers are developed using some programming language and they do not pro- 
vide abstract mapping languages with verifiable properties [42,11,43]. In this 
paper, we advocate a Description Logic framework [9] (such as proposed in the 
context of the DARPA KSE [39]) for the declarative specification of Z39.50 wrap- 
pers using high-level concept languages. We claim that modeling the required 
mappings as first-class citizens, instead of hard coding them in the wrappers (i) 
allows the formal validation of the translation quality (e.g., ill-defined mappings, 
inappropriate APs); and (ii) opens unexpected opportunities to tackle a num- 
ber of Z39.50 pending issues (e.g., metadata retrieval, query failures due to not 
mapped APs, multiple answer sets handling, etc.). 

Building a wrapper for an information source according to a Z39.50 profile 
(e.g., for digital libraries [32,31], museums [44,20], etc.) implies the translation 
of (i) the Z39.50 Access Points (AP) to the underlying source data structure and 
semantics, (ii) the Z39.50 Boolean filters to the source query primitives, and (iii) 
the returned source data from their original format to a predefined Z39.50 record 
syntax (e.g. GRS-1, XML). For loosely structured sources (e.g.. Information 
Retrieval Systems) wrapping is relatively simple. It essentially requires to define 
some renaming mappings from the APs to the source data attributes, tags, etc. 
(e.g., the AP AU to the field author, etc.). However, for highly structured sources 
(e.g.. Database Management Systems, Knowledge Base Systems) the translation 
process is considerably more complex. This is mainly due to the fact that there 
exists a significant mismatch between the Z39.50 flat view of information and 
the underlying source data model (e.g. relation or class based). In this context, 
what is really needed is to define for each AP a view on the source data. 

To address this issue we introduce an intermediate level between the Z39.50 
and the source world, based on advanced knowledge representation and reason- 
ing support, specifically Description Logics (DL). DL provide declarative lan- 
guages to represent and reason about interrelated sets of objects using modeling 
primitives such as concepts, roles, and individuals. Starting from a set of prim- 
itive concepts and roles representing source conceptualization, we capture the 
semantics of the AP mappings as derived concepts formed by primitive ones 
and standard DL concept operators [5]. Since DL can serve both as knowledge 
representation languages and as query languages [8,40,14], derived concepts es- 
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sentially act as views [15] against which Z39.50 queries are evaluated with source 
data. Our contribution is twofold : (i) we propose a toolkit for the declarative 
specification of Z39.50 wrappers using standard DL reasoning mechanisms [22]; 
and (ii) we enrich the generated Z39.50 wrappers with a number of added-value 
services such as conceptual structuring of flat Z39.50 vocabularies and intelli- 
gent Z39.50 query processing in order to facilitate metadata retrieval and avoid 
embarrassing query failures due to unsupported, i.e., not mapped, APs. 

The rest of the paper is organized as follows. In Section 2 we give an ex- 
ample of a cultural information source and describe the encountered problems 
to wrap it according to a digital museum Z39.50 profile. In Section 3 we briefly 
recall the core Description Logic (DL) model and we show how it can be ap- 
plied for the declarative specification and validation of Z39.50 AP mappings. 
Section 4 presents the Z39.50 query processing in our DL framework and Sec- 
tion 5 elaborates on the offered added-value wrapping services. The architecture 
of the Z39.50 wrapper toolkit is presented in Section 6. Finally, we conclude and 
discuss future work in Section 7. 

2 An Example of a Cultural Information Source 

In this section we describe the contents and structure of a cultural information 
source that will be used as running example in the rest of the paper. We focus 
on the mismatch of the information conceptualization in our test database and a 
Z39.50 profile for Digital Museums [44,20], as well as on the consequent problems 
we have encountered in order to develop a Z39.50 wrapper in the context of the 
AQUARELLE and CIMIzit projects [36,35]. 

2.1 The CLIO System 

As a testbed we use the CLIO cultural documentation system, developed at the 
Institute of Computer Science, Foundation for Research and Technology-Hellas 
(ICS-FORTH) in close cooperation with the Benaki Museum, Athens and the 
Historical Museum of Crete, Heraklion. CLIO supports the recording and man- 
agement of an evolving body of knowledge about ensembles of cultural goods 
and addresses the needs of museum curators and researchers. The functional ker- 
nel of CLIO is the Semantic Index System (SIS) developed by ICS-FORTH [21]. 
SIS is a persistent storage system based on the object-oriented semantic network 
data model TELOS [37] . 

Figure 1 illustrates some features of our example data source inspired by 
the CLIO system namely simple and multiple classification as well as multi- 
valued and optional attributes. A museum object is represented as an instance 
of the class “MuseumObject” . It may have {optional attributes) an owner (class 
“Owner”) and be constructed with the use of one or more {multi-valued at- 
tributes) materials (class “Material” ) , processes (class “Process” ) and techniques 
(class “Technique”). Each museum object is associated a series of events (class 
“Event”) characterized by their kind, date and involved actor. For instance, the 
saber of Androutsos (a hero of the 1821 Hellenic Revolution) is made of shaped 
silver {multiple instantiation) and it was constructed by Filimon in 1815. Al- 
though not illustrated in our example, SIS-TELOS also supports simple and 
multiple inheritance, unbounded classification, and treats attributes as first class 
citizens classified on their own. 
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Fig. 1. An Example of a Cultural Information Source 



2.2 Z39.50 Wrapping for Digital Museums 

Z39.50 [2] is a session oriented and stateful application protocol, based on the 
client-server architecture. To overcome semantic and schematic discrepancies 
among the various data sources, Z39.50 relies on a common information model 
shared by all clients and servers. It consists of a flat list of fields, called Access 
Points (AP) (or more precisely Use Attributes), on which queries are expressed. 
For instance, in the CIMI [20] and AQUARELLE [44] profiles, the supplied APs 
correspond to general information categories like People (specific persons or cul- 
tural groups). Dates of many sorts (including dates of creation, acquisition, exhi- 
bition), Places (e.g. place of creation, places associated with an event, galleries, 
provenance). Subject (exact description of depicted material). Style (including 
movement and period). Method (including process and techniques). Material, 
etc. [29]. 

This vocabulary of fields is employed by a client in order to search and iden- 
tify records from the underlying sources and next, to retrieve some or all of 
them. Z39.50 queries are formulated using Boolean connectors {and, or, and- 
not), search terms (i.e. Use attribute- value pairs), and qualifiers specifying lex- 
icographical comparisons (e.g., greater than), truncations (e.g. right, left), etc. 
Going back to our cultural scenario, the following query searches for all the 
museum objects related with Androutsos, that have been created after 1887 : 
Ql: PersonalNcmie= “Androutsos” sind 

(DateOf Great ion=1887 Relation=“GreaterThan”) 

According to Figure 1, the person Androutsos might be the creator (i.e. the 
actor involved in a creation event), or the owner of the object. This implies 
that a query on the AP PersonalName should be translated by the wrapper 
into queries on the source Actor and Owner classes. Furthermore, a query on 
the AP DateOf Creation should be translated into queries on the TimeSpan 
class and the associated ObjecLEvent and Kind classes. Finally, the returned 
museum objects information, should be formatted/converted by the wrappers 
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according to a common agreed record syntax (e.g. GRS-1, XML) and structure 
(e.g., elements Ohjectid, Title, Creator, etc.). 

We believe that the underlying Z39.50 information model is more suitable to 
query loosely structured text bases than highly structured data sources. Indeed, 
due to the significant mismatch between the Z39.50 and the source information 
model, most of the existing structure and semantic richness of the sources is not 
taken into account during querying while wrapping becomes considerably more 
complicated. It becomes clear that an AP may be translated to a source query 
on one or more classes or relations using one or more attribute selections, joins, 
etc. There is a wide range of Z39.50 mapping cases (see below) and nothing 
guarantees that the semantics of the specified views correspond to the intended 
meaning of the APs in the Z39.50 profile: it may be included in the original 
AP meaning, partially overlapped, disjoint, etc. This is typically the kind of 
information that is missing from existing Z39.50 wrappers in order to verify the 
quality of the retrieved data (i.e. accuracy, consistency, completeness, etc.). Two 
Z39.50 wrapping issues are worth further elaboration and they will be addressed 
in the rest of the paper. 

Unsupported Access Points: Since the AP meaning is defined in a profile without 
prior knowledge of the source contents, it may correspond to information only 
implicitly represented in the source or it may not correspond at all to any source 
information. For example, our cultural source documents objects from the gun 
collection kept in the Benaki Museum and although not explicitly stated, this 
information could be used to answer queries on the AP Location. On the other 
hand, the AP Protection Status, dealing with buildings and monuments, is not 
at all applicable. According to the protocol both APs are considered as unsup- 
ported in our source and queries containing them will fail and return diagnostic 
messages. For large scale applications where queries are generated by a Z39.50 
client without a knowledge of wrappers’ metadata (i.e. mappings) it is very likely 
to exist at least one unsupported AP per source. This will result in embarrassing 
query failures and users risk to obtain no answer from the sources. A commonly 
used approach to cope with this problem is to omit the unsupported APs from 
the broadcasted query and try to answer only the supported part. Obviously 
with this approach users are not aware if the returned answers resulted from 
the execution of the full query or from a part of it, while the various Z39.50 
wrappers behave in an unpredictable manner. 

Fixed collections of Retrieved Objects: The information returned in response to 
a client request is always associated with a specific data collection in the source 
(e.g. a persistence root). In the rest of the paper we will call this root as central 
concept. No matter what the queried APs are, the answer always correspond to 
central concept instances (e.g., museum objects) appropriately converted into a 
record structure having a fixed number of fields (also defined in the profile as 
Record Elements). This implies that all the queried fields are supposed to be 
connected in a source with the Z39.50 central concept. However, this is not the 
case with structured sources (relational or object-oriented) where multiple col- 
lections are supported and data relationships are not always explicitly stated in 
the schema (using external keys or object paths). Furthermore, even when such 
paths are explicitly stated, Z39.50 profiles usually support APs for expressing 
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full-text queries that require navigation over sets of paths. For instance, we may 
use the AP Any, to query on term “Androutsos” , without specifying what ex- 
actly the related APs are : “Androutsos” may correspond in our cultural source 
to a person owning an object, a person creating an object, a geographical loca- 
tion, etc. Unless the native query language of the source supports generalized 
path expressions [18,19], this kind of mappings cannot easily be expressed in 
structured sources. It is up to the Z39.50 wrapper administrator to decide query 
evaluation under these circumstances in a more or less ad hoc way. 



3 Declarative Specification of Z39.50 Wrappers Using DL 

Description Logics (DL), also known as terminological logics, has been inten- 
sively studied for more than a decade in the field of Knowledge Representa- 
tion and Reasoning Systems (KRRS). DL provide declarative languages for the 
representation and reasoning about classes of objects and their relationships, 
encompassing other well-known formalisms such as entity-relationship or class 
inheritance models [17]. Recently DL have received considerable attention in the 
context of information integration systems [3,33,16,25,6] since it was proved to 
provide flexible formalisms to model and reason over a large number of data 
integration views [34]. We follow the same approach to declaratively define the 
required AP mappings as views over source data. It should be stressed that, 
compared to previous work on data integration, our context is quite different: 
(i) Z39.50 wrapping involves only one source at a time (vs. mediation of several 
sources); (ii) Z39.50 world view of information is intrinsically flat (vs. middle- 
ware structured models); and (iii) Z39.50 wrappers support some query process- 
ing (vs. simple translations of queries and data). In the sequel, we briefly recall 
the core DL model that we use to cope with the various Z39.50 wrapping issues 
presented in the previous section and provide Z39.50 wrappers with formally 
verifiable mapping specifications. 



3.1 The Core Description Logic Model 

The main modeling primitives of Description Logics (DL) are concepts, roles, and 
individuals. A concept describes a class of elements (individuals) in the domain 
of interest and is defined by the conditions that must be satisfied by the elements 
in the class. A role describes a relationship between two individuals. The two 
basic components of a DL system are the terminological box (TBox) and the as- 
sertional box (ABox). The former contains the concepts (intentional knowledge) 
and the latter contains the individuals (extensional knowledge). There exist two 
types of concepts: Primitive and Derived. The definition of a primitive concept 
specifies only the necessary conditions for an individual to be an instance of it. 
On the other hand, the definition of a derived concept states the necessary and 
sufficient conditions for an individual to be instance of it. This implies that an 
individual has to be explicitly defined as instance of a primitive concept, while 
instances of derived concepts are inferred by the DL system. 

The interpretation of a DL knowledge Base A is I = (I(Z\),I(-)) where 
I(Z\) denotes a non-empty set of values (the domain) and I(-) an interpretation 
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function, mapping every concept to a subset of 2i(Z\), every role to a subset of 
X{A) xX{A), and every individual to an element oiX{A) such that X{a)^X{b) for 
different individuals a, b (Unique Name Assumption). Intuitively, the interpre- 
tation of a concept C (denoted as X{C)) is the set of objects that are known to 
belong to that concept. A concept Ci is said to be subsumed by another concept 
C 2 (denoted as C'i<C' 2 ) if and only ifl(C'i) C X{C 2 )- Based on this subsumption 
relation, a set of concepts can form a taxonomy having a bottom (_L) and top 
(T) concept. 



Name 
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Repr/tion 


Semantics 


Concept Name 


A 


A 


T(A) C 1{A) 


Top 


TOP 


T 


A 


Bottom 


BOTTOM 


T 


0 


Union 


(OR C D) 


AUC 


{dijdi €l{A)ul{C)} 


Intersect 


(AND C D) 


Anc 


{dijdi ei{A)ni{C)} 


Not 


(NOT A) 




{dijdi ^T(A)} 


Existential 

Quantification 


(SOME R C) 


3R.C 


{di\3d2 : (di,d2) eX(R)/\ 
d2 e X{C)} 


Universal 

Quantification 


(ALL R C) 


yR.c 


{dijVda : (di,d2) eX{R) 
d2 € X{C)} 


OneOf 


(ONEOF i, j,...) 


{ *, j,-} 


{ i, 3,--} 


Role name 


R: A, B 


a\R\b 


1{R) C X{A) X X{B)) 


Reverse 


(REVERSE R) 


R-^ 


{(di,d2)|(d2,di) e2:(7^) 



Table 1. Concept and Role forming operators 



The part of the TBox that contains the primitive concepts is called schema 
part while derived concepts form the view part [15]. The TBox-sc/iemo part 
consists of a finite set of axioms having one of the forms: A<D, R<CxD, where 
A, C, D are primitive concepts, and i? is a role (note that roles have restricted 
to and from values). The TBox-mew part consists of a finite set of concepts 
definitions having the form A=E where A is a derived concept and if is a concept 
expression formed by other concepts and the operators shown in Table 1. Cycles 
in concept definitions are not allowed (see [38] for formal definitions). In the 
next subsection we will explain these operators through examples illustrating the 
mappings of Z39.50 APs to our cultural source. Finally, disjointness of classes in 
the TBox is given by axioms of the form: A\\C (i.e., X{A) nl(C') = 0). 

The ABox is defined from a finite set of declarations having one of the forms: 
C{a) and R{a,b). The first one (unary predicates) declares that individual a 
belongs to the interpretation of the primitive concept C and the second one 
(binary predicates) declares that there exists a role R from a to & (belonging 
respectively to the interpretations of concepts C and D in the definition of R). 
The main reasoning services [22] offered by a DL system E are the following: 

— Concept Satisfiability {E'^C=1-) checking if a concept has not an empty 
interpretation, 
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— Subsumption Checking (S\=Ci<C 2 ) checking if a concept C 2 subsumes 
Cl, 

— Instance Checking {S\=C{a)) checking if an individual a belongs to the 
interpretation of a concept C. 

The above core model corresponds to an almost standard DL framework [5] , ac- 
tually supported by several DL systems^ e.g., CICLOP^, FaCT^, KRIS^, HAM- 
ALC^ etc. 

3.2 DL Concept Languages for Z39.50 AP Mappings 

In a very natural way, source structure and semantics can be represented as 
primitive concepts and roles, while the AP mappings as derived concepts (i.e. 
views) defined on top. Figure 2 illustrates the primitive concepts (TBox-sc/iema 
part) representing our cultural source schema given in Figure 1 while the derived 
concepts (TBox-mew part) correspond to the established mappings of the CIMI- 
AQUARELLE profile APs [20,44]. The data of our cultural source correspond 
to the individuals (ABox) of the DL System. Note that this is only a logical 
view of information from the Z39.50 wrappers (see Section 6) and there is no 
need to actually load source data into the DL system (virtual Abox). In the 
following examples we illustrate the expressive power of the proposed DL concept 
language (see Table 1) to capture the various kinds of translations involved in 
Z39.50 wrapping for structured sources (see Section 2). 

Example 1: Perhaps the simplest case to map an AP is when its semantics 
corresponds exactly to one concept of the source schema. For instance, the AP 
Date is translated as follows : 

Date = Time -Span 

Example 2: In most practical cases, APs should be mapped by combining more 
than one source concepts using the DL Union and Intersect concept forming 
operators. For instance, information about persons in our cultural source is rep- 
resented by the concepts Actor and Owner, and the AP PersonalName is mapped 
as follows : 

PersonalName = Actor U Owner 
Similarly, the mapping of the AP Method is defined as : 

Method = Process □ Technique 

Furthermore, mappings of abstract APs like Who describing any personal or 
corporate name that can be found in our source, are defined by using other AP 
derived concepts such as : 

Who = PersonalName U CorporateN ame 
Any = WhoLi WhatU WhenU Where 

Finally, APs like Any, for full-text queries are easily mapped by considering the 
definitions of abstract APs like Who, What, When and Where (the 4W APs) . 

^ The only subtle issue here is the introduction of restricted and inverse roles as in [26] . 
^ http;/ /www-ensais.u-strasbg.fr/LIIA/ciclop/ciclop.htm 
® http:/ /www.cs.man.ac.uk/'horrocks/FaCT/ 
http:/ /www.dfki. uni-sb.de/Tacos/kris.html 
® http:/ /kogs-www.informatik.uni-hamburg.de/'moeller/ham-alc/ 




Declarative Specification of Z39.50 Wrappers Using Description Logics 



391 




Location 



Method 



Namey CPersonal Name_ 



Event 



Protection Status 



XBox (Schema part) 



Time_Span 



MuseumObject 



offcind 



Object Event, 



Technique 



Process 









Material 



Silver 



Saber5691 



Filimon 



Androutsos 



^Shaped Silver 



ABox 



XBox (\’iew part) 



Fig. 2. Modeling an Information Source and Z39.50 APs mappings in DL 



Example 3: More complicated situations arise when the AP mapping requires a 
traversal over the roles associated with aggregated source concepts. For example, 
to map the AP DateOf Creation we need to define the following derived concept 
using the DL Inverse Role operator : 

DateOf Creation = 3{happenedln)~^ .{3of Kind. {“Creation”}) 

The above expression has three parts: (i) the bracket expression corresponds to 
a concept having as interpretation only the individual “Creation”, i.e. subsumed 
by Kind, (ii) the parenthesis expression represents the related creation Events, 
and (iii) the whole expression captures the Dates associated with these events. 
Note that the restriction of a role to and from values obviates the need to verify 
that the returned individuals actually belong to the interpretation of Date. 

Example f: For APs corresponding to information not explicitly stated in a 
source, the DL OneO/ concept forming operator is used to translate them. For 
instance, although not given in our example source, it is known that all ob- 
jects belong to the Benaki Museum (Athens) gun collection, and hence the APs 
CorporateName, Location and Collection are mapped as follows : 
CorporateName = {“Benaki Museum”} 

Location = {“Benaki Museum Athens”} 

Collection = {“Benaki Gun Collection”} 

This implies that CorporateName, Location and Collection are concepts whose 
interpretation contains only one individual, respectively “Benaki Museum” , “Be- 
naki Museum Athens” and “Gun Collection” . 

Example 5: In the case where there is no information in the source corresponding 
to a specific AP, the related derived concept is defined to be equivalent either 
to the Bottom i.e. the concept with an empty interpretation or the Top i.e. the 
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concept whose interpretation contains all the individuals. The decision depends 
on the expected precision and recall: the former favors precision while the latter 
recall. More precisely, according to the semantics of APs in a Z39.50 profile, 
we consider that the Top for AP mappings is the AP concept Any previously 
defined (see Example 2). For instance, the AP ProtectionStatus which is used 
for preserved buildings and cannot be mapped to our cultural source of museum 
objects, is translated as follows : 

ProtectionStatus = _L (or ProtectionStatus = Any) 

In both cases, wrappers are able to smoothly incorporate unsupported APs into 
the query processing (see SubSection 5.2) and avoid embarrassing query failures. 

3.3 Formal Validation of Z39.50 Wrapping Quality 

Having defined the mappings of the Z39.50 APs as derived concepts on top 
of a source schema (i.e. views), standard DL reasoning services like Concept 
Satisfiability can be used to infer if some or all of the APs mappings are ill- 
defined. Consider, for instance, that the concept Material of our culture source 
is disjoint with the concepts Technique and Process (see Figure 2). Then, the 
following mapping of the AP Method (see Example 2) is inconsistent: 

Method = Material □ Process □ Technique 
Indeed, due to class disjoiness the AP derived concept Method describes a neces- 
sarily empty set. In our DL framework we can formally check whether S^Method 
= T, i.e Methodhas a contradictory description (i.e. intentional semantics). More 
generally, we can verify the consistency of all the established mappings (i.e., that 
are well defined and not mapped to the bottom) without actually accessing the 
source data, by simply checking whether the TBox has at least one model: 

This kind of quality services are not supported by existing Z39.50 wrappers. 

To conclude this section we should note that modeling the AP mappings as 
DL derived concepts allows to develop Z39.50 wrappers with formally verifiable 
properties. More precisely, (i) APs whose meaning is not at all or only implicitly 
represented in the source can be effectively mapped and smoothly incorporated 
into the query processing; and (ii) consistency of the established APs mapping 
can be easily checked without accessing the source data. These added value 
services are quite useful for profile developers, Z39.50 wrappers administrators 
and end-users. 

4 Z39.50 Query Processing Using DL 

Since DL can serve both as a knowledge representation language and as a query 
language [8,40,14], Z39.50 queries can also be modeled as derived concepts. More 
precisely, a query can be seen as a description of the necessary and sufficient 
conditions that have to be satisfied by the individuals forming its answer set, i.e. 
its interpretation. Conversely, primitive (i.e., source) or derived concepts (i.e., 
AP mappings) can be used for data querying by considering their interpretation. 
In the sequel, we present how the Z39.50 Boolean filters can be (i) translated by 
the wrappers using the same DL concept language employed to map the Z39.50 
APs, and (ii) rewritten by taking into account the defined AP views and the 
fixed central concept of the data actually returned by a source (see Section 2). 
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4.1 The Core Z39.50 Query Languages 

As we have seen in Section 2, Z39.50 queries are essentially composed of search 
terms with APs and qualifiers for comparisons, truncations, etc., eventually com- 
bined using Boolean connectors. Consider, for instance, the following simple 
query (i.e. no qualifiers) : 

Q2: PersonalName = “Androutsos” 

Recall that PersonalName is an AP, mapped as derived concept (Cap) to the 
Actor and Owner concepts, and “Androutsos” a value considered as individual 
(a). Q2 can be translated into a basic query to the DL knowledge base S using 
the Instance Checking reasoning service (Cap ( a)) : 

S\=Per sonalN ame{ ^‘‘Androutsos '' ) 

If the individual “Androutsos" is in the interpretation of the concept Person- 
alName (i.e. the union of the Actor and Owner interpretations), the knowledge 
base returns a positive answer and the answer set (i.e., query interpretation) con- 
tains only the individual “Androutsos" . Else the answer set will be empty. More 
formally, core Z39.50 queries can be defined as DL derived concepts (Tbox-query 
part) that will be interpreted with source individuals (Abox) in the following 
way : 

Definition 1. Given a DL knowledge base S, an Access Point derived concept 
Cap and a core Z39.50 query q of the form AP = a, the answer set of q is given 
by the interpretation of the concept Cq : I{Cq) = {o € Os \ E ^ Cap(o)}, 
where Os is the set of individuals of E. 

Note that query answering relies here on some form of closed world assump- 
tion [27]. In the style of [23] we make the realistic assumption about complete 
knowledge of the DL extensional part (i.e., source data) and thus consider in the 
interpretation of concepts only their known individuals. 

Now let us see how we can express Z39.50 queries using relation or truncation 
qualifiers like, for instance : 

Q3: PersonalName=“Andr” Truncation= “Right” 

These search operators are not directly expressed in a standard DL frame- 
work, but they can be captured as external functions. The DL operator TEST-C 
allows to call various test functions outside of a DL system. This operator is es- 
sentially an escape method from the limits of the DL expressiveness allowing to 
manipulate individuals using external functions written in some programming 
language (see e.g., CLASSIC® [12]). A test function f gets an individual as ar- 
gument and returns TRUE or FALSE if it satisfies the conditions specified in 
the body of the function. The interpretation of the expression TEST-C(/) is 
then all the individuals which, given as argument, the TRUE value is returned 
by /. Only monotonic functions are considered in this respect. Q3 can then be 
translated as follows : 

21 (Cq 3 ) = {a G Os \ E \={PersonalNamer\TFiST-C{rtrunc,,^^j^pT))} 

® http:/ /www. research.att.com/sw/tools/classic/classic.html 
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where is a test function supported by our example source, which 

performs right truncation on string “Andr” . 

Finally, the concept forming operators □, U, and -■ (see Table 1) can be 
straightforward used to capture the Z39.50 Boolean connectors and, or, and 
and-not. 

It should be stressed that when the search operators defined in a Z39.50 
profile are not supported by the underlying source, we are confronted with the 
same problems as in the case of unsupported APs. To cope with these problems 
we follow the same approach presented in the previous section, allowing to map 
unsupported Z39.50 search operators either to the true or false test functions. 
The former favors recall, since it returns all the individuals of the queried AP 
concept, while the latter favors precision, since it returns the empty set. In both 
cases, wrappers are able to smoothly incorporate unsupported search operators 
into the query processing. 



4.2 Z39.50 Query Answering 

Unfortunately, the above translation into DL is not sufficient to express the exact 
semantics of Z39.50 queries as defined in a profile. We have seen in Section 2, 
that the result of a Z39.50 query is the set of related individuals belonging to 
a central concept of interest (e.g., the root of museum objects in our cultural 
scenario), rather than the set of individuals that belong to given AP derived 
concepts and satisfy the search conditions. To cope with this problem we need 
to define the central concept (Cc) in the Tbox as a derived concept (e.g. Cc = 
MuseumObjecf) and then introduce concept path expressions (Pap) connecting, 
through roles, the individuals of Cc with the various AP concepts involved in a 
query. For instance, for the AP derived concept Dateof Creation used in Q1 we 
consider the following path (see Figure 2) : 

Poateof Creation = 3hasEvent.{3happenedIn.TimeSpan) 

Since Dateof Creation is only a simple case and AP derived concepts are usu- 
ally defined by more complex concept expressions (e.g. PersonalName), what is 
really needed is to declare, for each of the involved primitive concepts (e.g. Actor, 
Owner), the corresponding paths to the central concept e.g., (see Figure 2) : 

PpersonaiNamei = 3hasEvent.{3hasActor.Actor) 

PpersonaiName2 = 3ownedBy .Owner 

The same approach is followed in order to consider the paths of composite 
APs (e.g., the 4W APs) defined in terms of others. More formally : 

Definition 2. A path expression Pap is a sequence of elements p = 6162 . . . e„ 
such that for i G [l,n— 1] : Cj G {3}U{V}U7?.., where TZ is the set of the primitive 
role names (suffixed by “.”) and e„ € C is the set of primitive concepts. 

These paths are then used during Z39.50 query translation to capture the ex- 
act answer set {C Answer) with individuals of the central concept. More precisely, 
we consider the following translation steps : 
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1. The core Z39.50 queries are initially translated into elementary DL query 
concepts as described in the previous subsection. For instance, the prelimi- 
nary translation of Q1 presented in Section 2 is : 

S \= {PersonalName{“Androutsos”)n(DateofCreationmEST-C{gtc,^^^jn )) 

2. Then the obtained expressions are rewritten into an intermediate canoni- 
cal form by expanding the involved AP derived concepts (Tbox-view part) 
into their constituent primitive ones and introducing the corresponding path 
expressions (Pap) emanating from the central concept. For instance, the 
canonical form of Q1 is : 

A ^ (( 3hasEvent.(3hasActor.Actor{“Androutsos”)U 
BownedBy .Owner {“Androutsos” )))n 

3hasEvent.{3happenedIn.{TimeSpan □ TEST-C((/t„j^ggy;> )))) 

3. The final expression of Z39.50 queries (CAnswer) is then obtained by con- 
sidering in the resulting canonical form only the individuals of the central 
concept (Cc) as follows : 

C Answerl — { C G \ E\= 

( MuseumObject{a)r\ 

{{3hasEvent.{3hasActor.Actor{“Androutsos”))U 
3ownedBy .Owner { “Androutsos”))r\ 

3hasEvent.{3happenedIn.{TimeSpan fl TEST-C((;t„j^gg.j,n )))))} 



It should be noted that for Z39.50 queries using full text APs like Any, we 
need to consider the paths to the central concept of all its constituent source 
concepts. The resulting canonical form essentially represents a set of queries 
capturing the translation of generalized path expressions at the source schema 
level [19] without requiring any extension of the underlying source query capa- 
bilities. Furthermore, as we will see in next section, the canonical form of Z39.50 
queries is a subject of optimization by the wrappers taking into account the 
subsumption relationships between derived or primitive concepts. 



5 Advanced Z39.50 Wrapping Services 

In Section 3 we showed the benefits from modeling Z39.50 AP mappings as DL 
concepts (i.e. views) in order to formally validating their consistency. In this 
section we focus on the capability of DL-based wrappers to reason about the 
relationships between the AP views as well as between these views and Z39.50 
queries also represented as DL concepts. Specifically, we show (a) how a flat 
Z39.50 list of APs can be organized in a subsumption taxonomy thus rendering 
their underlying source-specific conceptual structure; and (b) how Z39.50 queries 
can be optimized with respect to their intentional semantics without accessing 
actual source data (virtual Abox). 
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Fig. 3. Structuring a Flat Vocabulary of Z39.50 APs 

5.1 Conceptual Structuring of Flat Z39.50 Vocabularies 

Despite the simplified world view of information as a flat list of APs, Z39.50 
profiles are usually developed according to an implicit conceptual structure of the 
information requested by the users. Indeed, the APs defined in a profile represent 
real world entities for a particular application, function, or community, at various 
abstraction levels and with different relationships between them. For example, in 
the CIMI- AQUARELLE profile [20,44] we can observe a wide range of APs : from 
very abstract APs like Any, to general ones like What, Who, When and Where, 
(the 4W APs) until more specific like Date or DateOfCreation. Making explicit 
their relationships in the context of a specific source, is very useful for both end- 
users and third-party metadata providers. It essentially allows to understand 
why the conceptual structures of information in a source and a profile differ in 
order to improve the design of APs, query precision, interpretation of results, 
etc. 

We rely on the DL Subsumption Checking reasoning service to organize in 
a taxonomy the derived concepts capturing the AP mappings for a source. For 
instance, given the definition of Date and DateOfCreation (see Section 3) it 
can be inferred that DateO fCreation<Date (see [45] for formal definitions). In 
the simplest case the subsumption relationships are direct consequence of the 
definitions of composite AP concepts as for instance the 4W APs. 

Figure 3 illustrates the subsumption taxonomy of several CIMI- AQUARELLE 
APs as they are mapped to our example source (Tbox-view part). This taxon- 
omy serves as advanced knowledge support about wrapped sources (i.e. meta- 
data) which can be exploited off-line or on-line. In the latter case the Z39.50 
Explain service^ can be used. Note that accessing and exchanging source meta- 
data is not a simple task due to the different technologies (DBMS, KBS, etc.) 
employed by the sources and the various implementation choices made by wrap- 
per administrators. We believe that a DL concept language can also be used to 
facilitate metadata retrieval (i.e. AP mappings) in a way commonly understood 
by all clients and independent from the underlying source/ wrapper technology. 

5.2 Intelligent Query Processing 

In Section 4 we have seen that DL concept languages used to capture the schema 
of a source and define Z39.50 APs mappings as views on top of it, can also be em- 
ployed to express the Z39.50 queries against these views. Not surprisingly Z39.50 

^ A service allowing Z39.50 clients to retrieve information about servers. 
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queries can then be classified into the concept taxonomy using the subsumption 
relationships between them and the other primitive or derived concepts (Tbox) . 
The first benefit from this classification is to determine if a Z39.50 query can be 
effectively evaluated against the existing source schema and AP views. Indeed, 
after the translation of Z39.50 queries into a canonical DL form, wrappers are 
able to check whether the description (intention) of a query is contradictory 
without accessing the source data (ABox). For instance, the following query 
can be detected as inconsistent since it uses the AP ProtectionStatus mapped to 
the bottom concept. 

Q4: PersonalName = “Androutsos” and ProtectionStatus = “Preserved” 

If now a query is semantically well-defined it can be appropriately classified 
by determining the set of its immediate subsumers and subsumees, i.e. the con- 
cepts found above or below in the taxonomy. This classification opens interesting 
optimization opportunities since it induces a set of semantic transformations in 
order to locate the exact place of concepts in the taxonomy [7]. Consider, for 
instance, the following query where the derived concept Who subsumes Person- 
alName (see Figure 3) : 

Q5: Per sonalName= “Androutsos” or Who= “Androutsos” 

Q5 will be rewritten into the following semantically equivalent query that 
will be actually executed by the source : 

Q5’: Who = “Androutsos” 

Recall that according to the semantics of Z39.50 queries, the result is always 
composed of individuals of a central concept (Cc) like MuseumObject. Therefore 
Z39.50 queries like Q5 are always classified under Cc defined in the Tbox-view 
part. This enables an intelligent caching of query results [24,4] by the wrappers 
and a consequent optimization of Z39.50 queries. If the concept representing a 
query is found to be equivalent to one already existing in the taxonomy, the 
interpretation of that concept can be returned as an answer set instead of eval- 
uating it. This is the case of Q5 assuming that the equivalent query Q5’ has 
been previously evaluated and cached. Alternatively, the interpretations of all 
the immediate subsumers have to be checked against the query conditions. This 
is extremely useful, as Z39.50 is a stateful protocol and queries are quite often 
simple refinements of previously issued ones, like for example : 

Q6: Q5’ and When — 1815 

In this case Q5’ subsumes Q6 and only the second part of the query needs 
to be executed by the source (intersection is performed locally by the wrapper). 
Finally, the results of Q6 could also be cached in the wrapper. This implies that 
the cached interpretation of concept Q5’ will now contain only its proper indi- 
viduals i.e. those not belonging to the interpretations of its immediate subsumees 
like Q6. Note that supporting several query answer sets proves to be quite ex- 
pensive with current implementations of Z39.50 wrappers [42,11,43] replicating 
overlapped results. 
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Fig. 4. The Z39.50 Wrapper Toolkit Architecture 



6 Implementing a DL-Based Z39.50 Wrapper Toolkit 

The architecture of the DL-based Toolkit we have developed [45] is shown in 

Figure 4. It is composed of the following five modules : 

Module 1 is responsible for network communication with the client and is based 
on the Yaz toolkit [28] . When it receives a search request it decodes it into 
appropriate C structures. More specifically, it produces the syntax tree of the 
query that is included in the search request and sends it to Module 2. When 
a response has to be sent back to the client, this module is responsible for 
the transformation of the answer to the appropriate network format. 

Module 2 is used only during the search process. When it receives the syntax 
tree of a Z39.50 query, it translates it to a preliminary DL expression (see 
Section 4) that is sent to Module 4 for evaluation. After the execution, it 
receives the id and the cardinality of the result set (not the data themselves) 
and forwards this information to Module 1 to be sent back to the client. 

Module 3 is used only during the retrieval process. After receiving a Z39.50 
result set id it communicates with Module 5 to get the retrieved records in 
the form of C++ structures. The task of Module 3 is then to encode the 
returned C++ structures in one of the record formats defined in the Z39.50 
profile (i.e. GRS-1, USMARC or XML) in order to send the retrieved records 
back to Module 1. 

Modules 4 and 5 essentially form the DL-based wrapper for the underlying 
source (see dotted line in Figure 4). Module 4 loads the source schema and 
the AP mappings (Tbox) from a configuration file while the data reside 
in the source (virtual Abox) and can only be cached in the DL system. 
When it receives a DL query from Module 2, it rewrites it according to the 
defined AP mappings (see Section 3) and central concept of interest and 
forwards the resulting expression for evaluation to the underlying source (in 
our example, SIS). Finally, Module 5 converts the retrieved objects of the 
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central concept by taking into account the mappings of the Z39.50 Record 
Elements to the source data. Although not presented in this paper, these 
mappings are defined similarly to the APs. 

All modules are operational while Module 4 actually supports only the DL 
Instance Checking service and sources built on top of the SIS-Telos [21]. Due to 
the similarities between the DL and SIS-Telos query models, the translation of 
the resulting DL query expressions into our cultural source is straightforward. 
We plan to extend this interface of Module 4 for other data source technolo- 
gies, especially relational and object DBMSs (SQL, OQL), as already studied 
in [10,13,30]. 

To conclude, the modular architecture of the proposed toolkit allows to signif- 
icantly reduce wrapper development and maintenance costs. First, the DL-based 
Module 4 can be reused in order to wrap the same source according to multiple, 
possibly overlapping profiles (e.g., AQUARELLE-CIMI and Dublin Core). This 
obviate the need to merge different Z39.50 profiles into one, in order to be sup- 
ported by the existing wrappers [42,11,43]. In our approach, the profile becomes 
a characteristic of the client query, rather than a characteristic of the source. 
Second, the same Z39.50 server can support several wrapped sources. This is 
due to the fact that Modules 1,2 and 3 need not be aware of the Z39.50 APs 
(or Element) mappings to the various source data. This information is requested 
only by Module 4, i.e. the source wrapper. Hence, a server can support simulta- 
neously sources of different technology, as well as Z39.50 profiles with different 
APs mappings in each data source. 



7 Conclusion and Future Work 

In this work we have addressed the declarative specification of Z39.50 wrappers. 
We have presented a wrapper generation toolkit based on DL concept languages 
in order to map the Z39.50 world view of information to the underlying source 
data structure and semantics. The proposed DL mapping language offers a num- 
ber of advantages : (i) the required views over source data can be easily defined 
while a wide range of Z39.50 translation cases can be expressed (unlike stan- 
dard DBMS query languages such as SQL); (ii) it comes equipped with formally 
verifiable properties allowing to check the consistency of the defined views and 
therefore ensure the quality of the retrieved data; (iii) it enables reasoning about 
the relationships between these views and thus rending explicit to Z39.50 profile 
developers, end-users, etc., the conceptual structure of the Z39.50 vocabularies 
for a specific source; and (iv) it can serve to translate Z39.50 queries, which 
opens interesting opportunities for semantic query optimization and caching of 
results, exploiting as much as possible the stateful nature of the protocol. 

Currently, the developed toolkit supports only the DL Instance Checking ser- 
vice for evaluating queries and sources built on top of SIS-Telos [21]. We plan to 
complete the implementation of the toolkit in order to provide full-fledged DL 
reasoning services. There is on-going study of the available DL systems® for pos- 
sible integration in our toolkit. Furthermore, we intend to validate our approach 

® http: / / www.ida.liu.se/labs / iislab/people/patla/DL/systems.html 
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with several Z39.50 profiles and extend the wrapping facilities to other data 
source technologies (e.g., DBMS, IRS, etc.). Last but not least, we plan to apply 
the ideas presented in this paper at a higher level of information integration, in 
order to build intelligent mediators instead of wrappers [1] . 
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Abstract. This text motivates and defines a generic model for interac- 
tive (online or offline) product catalogs. Based on a detailed requirements 
analysis, the data model is defined using an object-oriented design no- 
tation and the query language for expressing customer interests on the 
catalog is defined using techniques from fuzzy set theory. The model 
provides the basis for the implementation of a generic, highly-interactive 
catalog management system which is designed to be interfaced with rela- 
tional databases, information-retrieval engines and special-purpose index 
structures. 



1 Introduction and Motivation 

An ever increasing number of enterprises is using digital media like online web 
servers or offline CD-ROMs as part of their communication with (potential) 
customers. Interactive digital communication media have the potential to close 
the gap between two extreme communication modes. One extreme is the direct, 
one-to-one conversation between an individual customer and a well-trained sales 
person. The other extreme is the mass distribution of anonymous information 
to the customer via broadcast media like advertisements, product fact sheets or 
product catalogs. In this paper, we present a generic data, language and system 
model for interactive (online or offline) product catalogs that act as active as- 
sistants to help customers find items (products, services, information resources) 
matching their very personal interests and preferences. In particular, the system 
is capable of establishing and sustaining a long-term customer relationship which 
may lead to a valuable learning process involving customers, catalog maintainors 
and information providers. 

We chose to name this model and system PIA (personal information assis- 
tant) emphasizing the active role of the system, driving a conversation from 
an initial (fuzzy) customer request towards a detailed mutual understanding of 
the (mis-)match between the customer’s personal preferences and the available 
corporate services and products. This should be seen in contrast to a more tradi- 
tional data-centered view of such product cataloges as index structures or search 
engines. 

S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL’99, LNCS 1696, pp. 403-422, 1999 
© Springer- Verlag Berlin Heidelberg 1999 
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As a consequence, our work focuses on those objects and algorithms that are 
visible from the catalog users’ perspective (customers, maintainers and providers) 
and that are relevant for the long-term cooperation between these users taking 
into account the inevitable evolution of customer interests and of the catalog 
contents and organization. At the present stage of our work, we are less inter- 
ested in identifying a restricted query language with strict algebraic properties 
which can be exploited for clever query optimization, but we prefer to provide 
customers with a rich (but formally defined) language of predicates and con- 
nectives for the interaction with the catalog which can also help the catalog 
maintainer and the information provider to learn from these customer needs. 

In this paper, we abstract from the specific terminology used in particular 
application domains (business, trade, library management, knowledge manage- 
ment, etc.) and focus on the strong commonalities in the information structures 
and interaction patterns with catalogs in these domains. We therefore talk gener- 
ically of a catalog of items with properties. The catalog items do not need to be 
physical objects described by numeric or discrete values, they can as well be 
service descriptions (trips offered by a travel agency, financial and insurance 
services, ...) or descriptions of information resources (bibliographic data, news 
articles, image descriptors, software components, ...). Each property of an item is 
defined by a value for an attribute. An attribute is associated with a domain which 
can be a numeric, a discrete, a full-text or a special-purpose domain defined by 
the catalog maintainer. A customer expresses his/her interest interactively as 
alternatives, each of which combines several criteria. Each criterion is expressed 
by a parameterized predicate on a single attribute. 

Our model is designed to be applicable to a rather homogeneous product 
catalog of a single producer (e.g., all cars produced by BMW, all books published 
by Springer) but also to a heterogeneous catalog integrating diverse products of 
many providers with no common property schema (e.g., all products at Harrods 
in London, all documents in a library). 

An important assumption of our work is the fact that PIA only provides de- 
cision support for a customer. Therefore, we do not have to take into account the 
problem of decision making based on fuzzy requirements and partial information. 
The main contributions of our work presented in this paper are 

— the conceptualization and formalization of the generic objects and cooper- 
ative processes involved (like interest, alternative, criterion, attribute, do- 
main, score, offer, decision, notification etc.), as described in Chapter 2 and 
in Chapter 3, 

— the conceptualizition and formalization of generalized fuzzy query operators 
in a query language which permits combined similarity queries ranging over 
numeric, discrete and full-text domains and also supporting semi-structured 
data objects (see Chapter 4) 

— the identification of supporting (commercially available) technologies and 
their consistent integration in an overall system architecture with well-defined 
component interfaces, demonstrated by a prototypical PIA implementation 
with an elaborate interactive front-end (see Chapter 5). 
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2 User Groups of Personal Information Assistents and 
their Requirements 

As explained in the introduction, we can distinguish three classes of cooperating 
actors interacting with PIA: Customers approach the system to easily find prod- 
ucts best suited for their interests. Providers employ the system as a show-case 
or electronic marketplace, publishing information on the products they offer. 
Maintainers act as mediators, keeping the catalog in a form that supports a 
long-term provider-customer relationship. 

The goals of a eustomer are 

— to gain an understanding of the items in the catalog and their properties 
by browsing, inspecting individual items, asking for representative samples, 
viewing summaries or ranked lists according to his/her personal criteria, 

— to refine incrementally his/her personal interests from initial fuzzy qualita- 
tive requirements to more detailed quantitative criteria possibly structured 
into decision alternatives, 

— to obtain specific offers from the provider taking into account his/her per- 
sonal preferences, 

— to decide on follow-up actions involving the matching items (buy a product, 
request a service, make use of a piece of information, enter a negotiation 
process using a different communication media, request notifications on new 
similar items becoming available in the future), 

— to be able to return at a later point in time and to easily resume a previous 
interaction with the assistant and being recognized as an individual customer 
with a particular past conversation history. 

The goals of a provider are 

— to clearly arrange all items available, being able to capture and emphasize 
their diversity and not being forced to fit them into a rigid common property 
schema, 

— to provide multiple access paths to the same item in order to give the cus- 
tomer maximum freedom in his/her decision process, 

— to learn from and to recall the personal interests of individual customers, 

— to identify and to quantify larger customer groups with shared preferences 
or interests, 

— to learn from the reactions of customers to offers made. 

The goals of a catalog maintainer are 

— to integrate information from multiple providers, 

— to detect and to resolve mismatches in attribute names, mesasuring units, 
or in the terminology used by multiple providers, 

— to accomodate the evolution of domains, attributes and properties over time, 
with minimal disruption of existing customer relationships, 

— to develop and enforce a consistent terminology for a smooth communication 
between customers and providers. 
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3 Design of a Generic Object Model for Personal 
Information Assistants 

Our model makes use of two complementary formalisms: First, the central se- 
mantic concepts like customer interest, criterion, item or property are represented 
as objects in an object-oriented data model. They can be mapped to values in 
(standard relational) databases, copied as structured values across the network 
and be associated with each other. We use the unified modeling language UML 
[1] and the standardized base types of Java to formally describe the static se- 
mantics of the PIA model. The main PIA concepts are summarized in Fig. 1 
and will be explained successively within this chapter. 

The dynamic semantics of the model relies heavily on fuzzy set theory [2, 
3], as a second formalism employed in our work. For example, an interest i is 
formalized as a matching method i.score() which assigns a score (a real number 
from the interval [0,1]) to a given item. A score of 1 expresses a perfect match 
while a score of 0 is assigned to items that are irrelevant w.r.t. to this interest. 

3.1 Modeling the Catalog Users 

The top of Figure 1 is a description of the three classes of actors cooperating with 
a PIA. Customer, Maintainer and Provider are subclasses of a common super- 
class Actor. A standard login procedure can be used to authorize and to identifiy 
maintainors and providers. However, immediate identification of customers may 
not be desirable and can be delayed until a later conversation phase, for exam- 
ple, when a subscription request is issued by a customer with a personal e-mail 
address. At the beginning of an anonymous interactive session, a new customer 
profile is created that can be merged with another customer profile, as soon 
as the identity of the customer can be determined (method merge With of class 
customer) . 

3.2 Modeling the Catalog Contents 

A PIA catalog is a collection of items (see classes on the right of Fig. 1). For 
communication purposes between customer, provider and maintainer, each item 
has to be assigned a printable unique catalogID. In practice, the choice of a 
naming scheme for catalog items is very important for the long-term evolution of 
the catalog (e.g., to identify sucessive versions of a product), but in the examples 
of this paper, simple integer numbers will be used. 

An item in the catalog does not need to capture all properties of an item but 
there may be further information on the item itself available elsewhere. In PIA, 
an attribute details is provided by objects of class item which can hold the URL 
of further information sources. This URL can also reference the item itself in 
case it is a purley digital artifact (a document, a piece of software, or a digital 
image) . 

In PIA, each item is described by a set of properties. A property is a pair 
consisting of an attribute and a value, for example: 
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Fig. 1. Overview of the PIA Object Model 
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(1345, { (product name, BMW 328i), (product category, sports car), 

(price, (30000, USD)), (average fuel consumption, (11 l/lOOkm)), 

(maximum speed, (230 km/h)), (SRS, present), 

(GPS, optional), ...}) 

Each attribute is identified by its name (e.g., SRS) and may have an optional 
link (URL) to a detailed description of the semantics of the attribute. For exam- 
ple, it should be explained what a SRS is or what measuring technique is used 
to determine the fuel consumption of a car, etc. Such descriptions play a central 
role in conversations between customers and sales persons. They contribute sig- 
nificantly to the customer’s learning process and help her/him to get acquainted 
with the interactive catalog. 

Each attribute is based on a domain that defines the universe of values avail- 
able for the attribute and a set of predicates defined on values of the universe. 
Multiple attributes may share the same domain (e.g., average fuel consumption 
and fuel consumption for city traffic are both based on the domain VolumePer- 
Length) . 

In our model, the product category of an item is defined as a regular property 
of the item. In particular, there is no typing or schema enforced on items of a 
certain product category. As a consequence, an information provider is free to 
attach arbitrary properties to an item or to omit properties for attributes where 
the values are unknown, unspecified, or simply not favorable. 



3.3 Modeling Domains and Valnes 

Despite the lack of schema constraints for product attributes, PI A enforces a 
large number of constraints on domains and values which are essential for the 
consistency and usability of the catalog. Domains and values exist indepent of 
concrete items, they define possible properties and they constitute the common 
language for the communication between customer and provider. The mainte- 
nance of domains and their associated objects (measuring units, discrete values, 
predicates, similarity functions etc.; see bottom of Fig. 1) is the task of the 
catalog maintainer. 

Each domain has a unique name in the catalog and provides a newValue 
method which is similar to a constructor in object-oriented programming lan- 
guages. The method returns a value of the domain identified by a textual de- 
scription passed as an argument to the method. Such a value-identifying string 
is called a literal in programming language terminology and is required in PIA to 
convert textual descriptions submitted by customers, maintainers and providers 
into internal value representations supporting efficient storage and retrieval. For 
example, the domain fuel consumption defines a method newValue which given 
the string ”12.3 l/lOO km” returns a value that represents a fuel consumption 
of 12.3 liters per 100 kilometers. 

PIA distinguishes three built-in classes of domains with distinct value repre- 
sentation. A concrete catalog assistant built with PIA contains a large number 
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of domains each of which belongs to one of these three classes and has been 
defined and customized interactively by the catalog maintainer. 

Numeric Domain A Numeric Domain describes numeric values represented 
internally as floating point numbers. A numeric domain may impose a (fixed) 
upper or lower bound on the admissible values. A numeric domain d can 
either describe ordinal numbers {d.isOrdinal, like the number of seats in 
a car) or it contains values that describe quantitative properties of items 
(like their length or their speed) which are expressed using a measuring 
unit (centimeter, inch, hour, second). Therefore, each numeric domain is 
associated with exactly one measuring unit (e.g. miles per hour). This unit 
can be regarded as a kind of reference unit for all other measuring units that 
measure the same physical phenomenon (e.g. meters per second or kilometers 
per hour) . Values for a numeric domain expressed in these derived measuring 
units are converted by PIA via a sequence of offset and factor calculations 
involving intermediate measuring units to the desired reference unit. This 
modeling is motivated by the requirement to allow customers and providers 
to work with their preferred measuring units and to also provide active 
assistance during interactive user input. 

Discrete Domain A Discrete Domain consists of a finite number of discrete 
values repesented by literals (or by aggreed-upon iconic visual representa- 
tions), for example: 

TV Norm = {PAL, Secam, NTSC} 

In many cases, discrete values used by providers and customers describe 
overlapping concepts and over time the set of discrete values evolves, includ- 
ing specialization and generalization of individual concepts. For the purpose 
of catalogs, it is therefore highly desirable to be able to arrange the dis- 
crete values in a single rooted tree such that the children of a discrete value 
specialize the concept of their parent. 

TV Norm = (Any TV Norm, { 

(PAL, { PAL B/G, PAL I, PAL L} ), 

(Secam, {Secam B/G, Secam D/K, Secam L}), 

(NTSC, 

The root of the hierarchy and the other inner nodes of the tree can be used 
by customers and providers as property values to denote approximations 
of existing values or discrete values not (yet) added to the domain by the 
catalog maintainer (e.g. after the release of a new PAL TV norm). This is 
again an example for a learning process where the catalog maintainer de- 
pends crucially on input (in this case notifications about exceptional values) 
from customers and providers to decide on ways how to improve the catalog. 

Full- Text-Domain A Full-Text Domain consists of full-text values with no 
further restriction on the contents of these strings. A catalog may feature 
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distinct full-text domains to distinguish string values expressed in different 
languages (e.g. in English or in German). Strings and their domains are 
modeled as subclasses FuUTextValue and FullTextDomain of the PIA classes 
Value and Domain, respectively (compare Fig. 1). 

There is a tradeoff between the use of full-text domains and the use of discrete 
domains since the maintenance of a set of discrete values puts a burden on 
the catalog maintainer and the information providers who have to agree on 
this common vocabulary. On the other hand, customers are eager to gain a 
deeper understanding of the search space which is better supported by the 
organized structure provided by discrete domains. A run-time conversion 
(evolution) of a catalog domain from a full-text to a discrete domain and 
vice versa should therefore be supported. 

For specific product catalogs, it may also be necessary to define additional 
refined domains, for example, to describe the domain of person name. By ap- 
propriate subclassing of the classes FullTextDomain, FuUTextValue and Full- 
TextPredieate, a catalog builder can choose a canonical internal representation 
for person names and can define specific predicates on person names (is sim- 
ilar to, sounds like, has family name, has first name, has low edit distance). 
Other examples of special-purpose domains are dates of the gregorian calendar 
or longitude/latitude coordinates on the globe. 

The concept of domains could also be extended to support set-valued do- 
mains (colorset is a set of color) and aggregate domains (address is an aggregate 
of street, number, and address) by an appropriate subclassing of the classes 
domain and value to introduce set and record type and value constructors, re- 
spectively. This extension would lead to an orthogonal type system where type 
constructors can be nested to arbitrary depth (e.g. to define sets of records with 
set attributes). However, this generalization would add significant complexity to 
the model and its implementation. 

3.4 Modeling Customer Interests 

A customer may have several unrelated interests (e.g., buy a book, find a present 
for a friend) pertaining to the same catalog. Therefore, PIA maintains a set of 
interests per customer which are distinguished by name (see classes on the left 
of Fig. 1). 

As a first cut, each interest can be described by a (non-empty) set of criteria, 
for example: 

{(produet eategory, is, eonvertihle ear, very important), 

(priee, is not mueh higher than, (15000, USD), normal), 

(maximum speed, is more than, (160 km/h), normal), 

(fuel eonsumption, is not too high, (), normal), 

(eolor, not(is), (yellow), normal), 

(number of seats, is between, (2, 4), normal) 

(SRS, is, (present), normal), 

(GPS, is, (optional), not so important) } 
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In our model, each criterion is expressed by a predicate on an attribute of the 
desired item(s) with an associated weight. Each attribute and predicate specified 
in a criterion has to be defined previously by the maintainer of the catalog (see 
Sec. 3.3 and Chapter 4). Each predicate can be negated expressing the fact that 
the customer is interested (most) in those items that do not satisfy the predicate. 

The weight of a criterion is intended to capture the relative importance of 
a criterion w.r.t. to the other criteria contributing to the interest. Technically 
speaking, a weight is a numeric value in the interval [0..1]. In order o simplify the 
interaction with customers, weights are denoted by literals chosen from a fixed 
(small) set of names (e.g., very important, important, normal, less important, 
not so important or simply essential, niee to have). 

Before the exact semantics and properties of predicates, criteria and of the 
dynamic combination of criteria are defined in Chapter 4, we have to further 
refine the notion of a customer interest. 

A simple ’’conjunctive” combination of (negated) criteria is sufficient in the 
early stages of the customer’s decision making process where he/she is exploring 
the space of items available and he/she is zooming in or out of the space by 
interactively adding or removing criteria or modifying predicate parameters. 
However, as a result of this refined understanding, customers tend to formulate 
alternatives each of which is in turn described by a combination of (negated) 
criteria. The decision which alternative is to be pursued further is made by 
a direct comparison of the items matching these alternatives. Eor example, a 
customer could formulate the alternative to either buy a light-weight laptop with 
a long battery lifetime or a fast desktop PC with a 3D graphics accelerator. 

In PIA, a customer interest can therefore be described as a (non-empty) set 
of alternatives each consisting of a set of (negated) criteria. An item can match 
the customer’s interest if it matches one (or several) of the alternatives. This 
” disjunctive” combination of alternatives and the semantics of weights assigned 
to individual alternatives are described in Chapter 4. 

Especially when searching catalogs with large amounts of items, a customer 
may want to restrict the number of potentially suitable items returned by the 
system. Thus, the information assistant provides the possibility to equip a cus- 
tomer interest either with a cutoff score, returning only those items for which 
the evaluation results in a score higher than the cutoff score, or a cutoff size, 
returning no more than the specified number of items. 

A possible generalization of our model is to add support for more customer- 
centered cumulative criteria, like the ones used in the following statement of 
interest: 

{(produet eategory, is, ear, normal), 

(performanee, is very high, normal), 

(eomfort, is not too low, normal), 

(lifetime, is high, important)} 

These cumulative and qualitative criteria would be based on predicates on 
numeric domains defined by the catalog maintainer. The corresponding attribute 
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values of items in the catalog could either be provided explicitly for each item 
or be derived via heuristic functions from other attribute values of the same 
item. Formal techniques for relating customer-centered qualitative criteria and 
product-centered quantitative criteria has been studied extensively since the 
mid-eighties in engineering sciences (quality function deployment, house of qual- 
ity diagram) [4]. 

A distinguished property of our model is the fact that the product category 
is captured as an ’’ordinary” attribute on a discrete domain as described in Sec. 
3.3. The intended positive effect of this uniformity is the fact that a customer 
can combine fuzzy predicates on the product category attribute with other pred- 
icates using weights, negation, conjunctive and disjunctive aggregation without 
being constrained by static schema information. Moreover, this uniformity makes 
it easy to implement catalog brokers that integrate the contents from several au- 
tonomous catalogs. 



3.5 Modeling Mutual Understanding between Customer and 
Provider 

The evaluation of a customer interest, as sketched in the previous section and 
detailed in the next chapter, leads to a set of matches which serve as a founda- 
tion for the establishment of a mutual understanding between the information 
provider and the customer (see classes in the center of Fig 1). Each match in- 
tegrates an item and a score resulting from the scoring method of the interest 
applied to this item. 

The evaluation may span one or more catalogs and produces a number of 
matches corresponding to the number of evaluated items or limited by some 
cutoff number or cutoff score defined by the user. The sequence of matches 
ordered by decreasing score is called an ojfer. A PIA offer may not only be 
displayed in the user interface, but it may also be saved by the customer for 
subsequent examination, or it may be re-created automatically after fixed time 
intervals as the result of a customer’s subscription. It is also possible to compute 
cumulative figures (frequency, min, max, median, ..) over an offer or to perform 
statistical cluster analysis on the offer. 

Examining an offer, the customer selects one or several items following her/his 
personal (subjective) rating. A decision with regard to a match can be observed 
only if the customer takes some action. As we abstract from any follow-up actions 
outside the field of decision support like e.g. putting a product in a shopping cart, 
we define that a decision is taken, if the customer expicitly marks the respective 
match as valuable. 

Decisions are administered by the information assistant, because the may 
provide useful hints on the personal understanding that a customer may have of 
an item and on her/his habits formulating interests. The decisions are, therefore, 
recursively connected with their preceding interest where they can be used for 
supporting relevance feedback functionality [5] . 
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4 Fuzzy Matching between Customer Interests and 
Catalog Items 



In Section 3.4 and in the left-hand part of the UML diagram of Fig. 1 we defined 
the (abstract) syntax for the customer interests (=fuzzy queries) by typed trees. 
In this section, we inductively define the semantics of customer interests by a 
query evaluation function which computes a numeric score for a given query tree 
(i.e., an object of class Interest) and for a given catalog item based on the item’s 
properties (its attribute / value bindings). 

In our object-oriented model, the score function is decomposed into (a hi- 
erarchy of) score methods defined in the three classes of the query tree nodes. 
The score for a given Item i and a given Interest x is thus defined as the value 
x.score(i) which is computed bottom-up from the scores of all predicates in the 
tree applied to i taking fuzzy negations, fuzzy and weighted conjunctions and 
finally fuzzy and weighted disjunctions into account. 

The evaluation of fuzzy predicates involving negation, disjunction and con- 
junction is defined in Section 4.3. The underlying definition and evaluation of 
simple predicates, i.e. predicates involving only a single attribute of an item is 
explained in Section 4.2. Together, these sections define the generic semantics 
of an implementation of the PIA query engine which is parameterized by the 
particular choice of the basic domains and their predicates (technically defined 
as class files dynamically linked to the PIA kernel). 



4.1 Capturing Similarity through Numeric Scores 

A severe limitation of boolean predicates underlying classical (relational and 
object-oriented) database models is the difficulty to adequately model fuzzy 
customer interests. For example, a user interested in an item with a price of 
less than 100$ would certainly be willing to pay 101$ for an item that matches 
perfectly all other criteria imposed on the item. It is therefore common practice 
in information retrieval and multimedia databases to use numeric scores in the 
interval [0,1] to model user interests [6,5,7]. 

However, different research communities have associated different (partially 
incompatible) interpretations with the values returned from such score func- 
tions, such astThe fuzzy set interpretation [2,8], the spatial interpretation orig- 
inally used in text databases, the metric interpetation [9], or the prohahilistic 
interpretation underlying advanced information retrieval systems [10]. 

Our goal in the design of the PIA model and system was to allow a maximum 
freedom in the formulation and combination of predicates while still preserving a 
minimum semantic consensus necessary to build a meaningful user interface, an 
efficient query evaluator, user profile manager, persistence manager etc. There- 
fore we constrained the system to a boolean combination of atomic fuzzy queries. 
More specifically, we allow n-ary (weighted) disjunctions of n-ary (weighted) con- 
junctions which leads to a model that is very similar to the work of [7] . 
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4.2 Definition and Evalnation of Simple Fnzzy Predicates on 
Nnmeric, Discrete and Fnll-Text Domains 

This section describes the semantics of simple predicates which are used by 
customers as building blocks in expressions to articulate their personal interests, 
as explained in Section 3.4. 

If a customer defines a criterion on an attribute a with domain d, he / she can 
only make use of predicates defined on d by the catalog maintainer. For example, 
it is not allowed to evaluate the numeric predicate is greater than on an attribute 
a that takes values of a discrete domain like TV Norm which does not define 
an order on its elements. This constraint is enforced already by the PI A front 
end through simple typing rules. As a consequence of this domain restriction, 
one can distinguish three kinds of predicates, called numeric, discrete and full- 
text predicates. Their superclass Predieate captures the eommon properties of 
all predicates (see Fig. 1). 

For example, if Length is a numerie domain, a boolean equality predicate can 
be defined for that domain as follows using a Java-like syntax for the function 
definition: 



(is exactly, 2, (Length, Length), 
float score(ltem i. Attribute a. Value) ] args) { 

Value val = i.valueOfAttribute(a); 

if (val == null) return 0.0; / / attribute a not defined for item i 
if (val.numericValue == args[0].numericValue) return 1.0; 
return 0.0; } ) 

The predicate is exaetly is defined on the domain Length and requires a second 
argument of domain Length. The function returns the value 0 if the item i does 
not have an attribute a or if the (numeric) attribute value does not exactly match 
the numeric attribute value specified as an argument. Otherwise, the value 1 is 
returned to indicate an exact match. This example also illustrates that standard 
boolean predicates (=, >,>,.. .) can be captured in our model by restricting the 
set of admissible score values returned to elements from the binary set {0,1}. 

As described in Section 3.4, a customer could then use this (parameterized) 
predicate to express his/her interest in items that have an exact width of 10 cm 
and an exact height of 15 cm (assuming that the attribute width and are both 
based on the domain Length): 

{(width, is exaetly, (10 em), normal), 

( , is exaetly, (15 em), normal)} 

This way, a predicate can be applied to different attributes and it can also 
verify whether an item posesses the indicated attribute at all. 

In the following, we illustrate the expressive power and uniformity of the 
PI A model by giving examples of predicates on numeric, discrete and full-text 
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Predicate Name Vector of Argument Domains 

is as large as possible; is (very) large; 
is average; is as small as possible; 

is (very) small (NumericDomain) 

is exactly; is approximately; 
is (strictly) larger than; 

is (strictly) smaller than (NumericDomain, NumericDomain) 

is exactly between; is between (NumericDomain, NumericDomain, 

NumericDomain) 



domains. We start with a list of generic convex fuzzy predicates on numeric 
domains: 

On a discrete domain consisting of a single-rooted hierarchy of values the 
following predicates can be defined generically: 



Predicate Name Vector of Argument Domains 

is an exact value; is vague (DiscretcDomain) 
is exactly; is subsumed by 

subsumes; is similar to (DiscretcDomain, DiscretcDomain) 
is one of; subsumes one of; 

is subsumed by one of (DiscretcDomain, DiscretcDomain[])^ 



In order to define the exact semantics of these predicates we prefer to use 
the model of fuzzy relations and fuzzy partitions (as analogues of mathematical 
relations and partitions derived from mathematical sets) [11]. 

Modern information retrieval engines provide efficient index support to return 
scores for the following predicates on full-text domains: 



Predicate Name Vector of Argument Domains 

is empty (Full-Text Domain) 

contains the word; contains a word 

starting with; contains the stemmed word (Full-Text Domain, Full-Text Domain) 
contains x near y; contains x followed (Full-Text Domain, Full-Text Domain, 
by y not more than k words apart NumericDomain) 

contains a passage of k words (Full-Text Domain, NumericDomain, 

which contain the words Full-Text Domain)]) 
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4.3 Definition and Evalnation of Complex Fnzzy Predicates 

As explained already in the introduction of this chapter, the score function defin- 
ing the semantics of a query is decomposed into (a hierarchy of) score methods 
defined in the three classes of the query tree nodes The score for a given Item 
i and a given Interest x is thus defined as the value x.score(i) which is com- 
puted bottom-up from the scores of all predicates in the tree applied to i taking 
fuzzy negations, fuzzy and weighted conjunctions and finally fuzzy and weighted 
disjunctions into account. 

A criterion describes the (possibly negated) application of a predicate onto 
an attribute. A unary predicate (e.g.isEmpty()) does not require any further ar- 
guments while most predicates (e.g. isApproximately(a) ) can be parameterized 
by the customer with additional values. 

class Criterion { 

Predicate predicate; Attribute attribute; Value[ ] arguments; 
float weight; bool negated; 
float score (Item i) { 

if (negated) { return 1.0 - predicate. score(i, attribute, arguments);} 
else { return predicate. score(i, attribute, arguments);}} 

} 



The score method for a criterion is computed by evaluating the score method 
of the predicate with the given arguments on the given attribute. For a given 
score s a negated criterion returns the score 1-s. 

A single alternative describes a weighted conjunction of criteria. In the litera- 
ture one can find a large number of different aggregation functions that compute 
a score for a weighted ’’conjunction” of scores [12, 7]. Some of these aggregation 
operators are sensitive to the order of their subterms [13] so that we chose to 
represent the subterms by a vector and not by a set. 

class Alternative { 

Criterion [ ] criteria; AndLikeAggregationOperator op; float weight; 
float score (Item i) { 
float[ ] cs = new float[criteria.size()] ; 
float[ ] cw = new float[criteria.size()] ; 
for ( int j= 0, j j criteria. size(), j++) { 

c=criteria[j]; cs[j]=c.score(i); cw[j]=c. weight; 

} 

return op(i, cs, cw);} 

} 



The score method for an alternative is computed by evaluating the score 
method of the criteria and then passing the results together with the weights 
of the subterms to the AndLikeAggregationOperator op. [7] describes desirable 
algebraic properties of weighted aggregation operators. 
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Finally, a customer interest is described as a weighted disjunction of alter- 
natives. The score method for an interest is computed analogously by evaluating 
the score method of the alternatives and then passing the results together with 
the weights of the subterms to the OrLikeAggregationOperator op. 

class Interest { 

Alternative[ ] alternatives; OrLikeAggregationOperator op; 
float cutoffScore; int cutoffSize; 
float score (Item i) { 

float[ ] as = new float[alternatives.size()]; 
float[ ] aw = new float[alternatives.size()]; 
for ( int j= 0, J j alternatives. size(), j++) { 
a=alternatives[j]; asp]=a.score(i); aw[j]=a. weight; 

} 

retnrn op(i, as, aw);} 

} 

5 The PIA Software Architecture 

The PIA object model serves as a framework for different query evaluation strate- 
gies, implementations of numeric scoring functions, and persistent storage tech- 
nologies for semistructured data (see Fig. 2). In this paper, we intentionally do 
not target the implementation of these components, which are critical for both 
query efficiency and effectiveness, but instead rely on already existing solutions 
referred to in chapter 4. The object model itself defines a clear-cut interface to 
these components which enables the smooth integration of supporting technol- 
ogy- 




Fig. 2. Overview over the PIA Software Architecture 



In the current state of our project, the PIA system is run as a Java Swing ap- 
plication communicating via RMI with the server containing the object-oriented 
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catalog model. Other implementations based on Java applet or servlet technol- 
ogy are also imaginable and could be realized with little effort. Catalog data is 
imported via an XML import filter and can also be exported into XML files. 
Although, from our point of view, XML is suitable for the representation of ob- 
jects in a personal information assistant, we still plan to realize filters for other 
well-accepted exchange formats, too. 

5.1 Interaction with the System 

The prototype’s user interface is implemented by the use of the Java Foundation 
Classes { JFC, also known as Swing). The large number of graphical components 
available in Swing eases the task of giving the interface a clear, understand- 
able structure which is essential for non-expert users of a personal information 
assistant. 
















Fig. 3. The Inteactive Query Interface for PIA Customers 



Taking the viewpoint of a PIA customer first, one can detect four large icons 
on the left side of the window, allowing the user to decide between four differ- 
ent activities. He/she can formulate an interest, evaluate this interest, change 
his/her personal preferences, or subscribe for periodical notifications on special 
offers according to his/her interests. These icons remain visible throughout the 
customer’s session. Providers and maintainers are faced with similar activity 
panels representing their specific activities with the PIA system. 

During a session, the information assistant stays in close interaction with 
the customer (and all other users). It provides e.g. tool tip information for the 
different graphical components that can be used. Furthermore, a status line at 
the bottom of the window displays messages important for the current state of 
usage (e.g. number of items matching the current query). 
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Formulating the query (see Fig. 3), the customer is being equipped with infor- 
mation on the catalog. In the table next to the activity panel, all the attributes 
currently available in the catalog are displayed together with the frequency by 
which they appear. In this way, the customer is able to estimate how effective 
the choice of a certain property in the formulation of his/her interest can be 
with respect to the current catalog contents. The current interest can be seen 
on the right of the window, where the desired criteria are shown together with 
their weight. The interest can be extended by clicking on one of the attributes 
in the attribute table on the left. As a result, an editor window opens. The cus- 
tomer now can choose one of the predicates available for the attribute’s domain 
and afterwards enter the necessary argument values. As soon as he/she has fin- 
ished, the new property will appear in the panel on the right. The status line 
is updated, now stating the number of items which would be retrieved from the 
catalog for the current interest, so that the customer knows how effective the 
query would be. The customer may now decide to refine the interest or to eval- 
uate it. Evaluation is triggered by pressing the respective button in the activity 
panel. 




Fig. 4. The Interactive Evaluation Interface for PIA Customers 



The evaluation returns a list of items ranked by their relevance for the interest 
(see Fig. 4). Each item is connected with a link leading to the URL containing 
the detailed item description. In this way, the customer is offered a quick access 
to generic decision support as expected from a personal information assistant. 
Using URLs here guarantees that the descriptions can also be stored and updated 
by the information provider, who, in many cases, owns the deeper understanding 
of the objects’ semantics. Having inspected the list of results, the customer may 
decide to reformulate his/her interest, which he can simply do by pressing the 
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query button on the left again. The program then returns to the query window, 
still showing the interest formulated before. Of course, both the formulation 
and the evaluation of interests are also accessible for catalog maintainors and 
information providers. 

The window for browsing and updating personal preferences can be reached 
by pressing the third button in the activity panel. Here, the customer is offered 
a tabbed pane consisting of different cards for different kinds of preferences. 
Among others, it is possible to choose preferred measuring units for domains or 
to view, refine or delete stored interests. 

Modalities for periodical notification on new offers can be settled by pressing 
the last button on the activity panel. The customer has the choice between two 
different kinds of subscription. On the one hand, he/she can receive a general 
list of special offers generated from time to time for all PI A customers. On the 
other hand, he/she can subscribe to offers according to his own special interests 
stored witin the system. 




Fig. 5. The Interactive Interface for PIA Maintainers 



The major task of PIA catalog maintainers is to keep the domains and at- 
tributes within the system in a state adequate to the needs of customers and 
providers. Thus, the maintainer’s activity panel includes three buttons for brows- 
ing and updating discrete, full-text and numeric domains, as well as one button 
for the maintenance of attributes (see Fig. 5). The structure of the windows 
hiding behind these buttons is always very similar. The maintainer can choose 
one domain or attribute respectively from the table in the upper part of the win- 
dow. Clicking on the desired line opens an editor window. This editor window 
allows to change the characteristics of domains and of attributes. For example, 
the maintainer can change the name of a domain, rearrange its value space, or 
add or delete predicates. He/she hereby is restricted to updates allowed by the 
system. In this way, a kind of type checking according to the PIA object model 
is realized. 
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Although PIA information providers would usually not type in product in- 
formation item by item, they are still offered an interface of their own, which can 
be especially practical in cases, when some incorrect information has come into 
the catalog before, via some import filter. The provider then has the opportunity 
to bring the catalog back to a consistent state using this interface. 

6 Concluding Remarks 

This paper does not propose yet another fuzzy query model or fuzzy predicate 
semantics but it defines the syntax and semantics of a generic framework for 
the implementation of interactive catalogs which do not require information 
providers to adhere to a strict data model and which allow customers to combine 
in an intuitive and flexible way fuzzy predicates over numeric, discrete and full- 
text domains. 

With our initial implementation of the query engine we target small to 
medium-sized corporate catalogs (up to 10"^ items with up to 10^ attributes) 
where a semi-naive evaluation strategy suffices to yield acceptable interactive 
system performance in single-user mode. In this scenario, usability and flexibility 
of the interactive query interface are more important than raw query execution 
speed. We closely follow recent developments in database research on optimized 
access to semi-structured data [14,15], on similarity indexing [9] and on fuzzy 
query evaluation techniques [7] which may provide significant potential for query 
evaluation improvements. 

Our model and system architecture has been influenced heavily by our expe- 
rience gained in building interactive web catalogs (e.g. electronic marketplaces 
for classified ads [16]) and we are currently transfering results of this research 
back into industrial projects with partners from the German media industry. 
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Abstract. This paper is concerned with tracking and interpreting scholarly 
documents in distributed research communities. We argue that current 
approaches to document description, and current technological infrastructures 
particularly over the World Wide Web, provide poor support for these tasks. 
We describe the design of a digital library server which will enable authors to 
submit a summary of the contributions they claim their documents makes, and 
its relations to the literature. We describe a knowledge-based Web environment 
to support the emergence of such a community-constructed semantic hypertext, 
and the services it could provide to assist the interpretation of an idea or 
document in the context of its literature. The discussion considers in detail how 
the approach addresses usability issues associated with knowledge structuring 
environments. 



1 Introduction 

The mere publication of information does not constitute a body of knowledge; nor 
does simply obtaining information constitute understanding. Obtaining documents is 
just the first step; meaning and significance arise through their interpretation, which 
results in an understanding of the perspective adopted. In this paper we describe a 
representation scheme implemented in a Web server architecture to assist scholars in 
articulating, interpreting and contesting perspectives. Specifically, these perspectives 
take the form of networks of claims about ideas and documents. We propose that 
researchers enrich their texts with nodes and links which they add to a semantic 
network. 

The paper begins by considering tasks that face scholars in analysing a document 
or literature. We argue that current approaches to document description, and current 
technological infrastructures, particularly over the World Wide Web, provide poor 
support for these interpretive tasks. We describe an approach to modelling the 
perspective in which documents are embedded, based on researchers’ claims about 
their own work, and those of others. We detail how such a model is being 
implemented in a Web environment, and consider the services it could provide to a 
scholarly community. These include the creation, visualization and interpretation of 
conceptual structures reflecting the relations between different research efforts, 
scholarly perspectives and debates, which we exemplify with a worked example. We 
then discuss related work, key issues that this work raises, and the next steps in our 
research programme. 

S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL ‘99, LNCS 1696, pp. 423-442, 1999 
© Springer-Verlag Berlin Heidelberg 1 999 
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2 The Scholarly Work of Interpretation 

Contextualising ideas in relation to the literature is a fundamental task for authors and 
readers — are they new, significant, and trustworthy? Scholars usually start by 
bringing to bear their own knowledge of the field. This often leads to commentary and 
discourse of various kinds, which reflect the extent to which peers regard an author’s 
work as authoritative, ranging from private annotation of a document, to formal peer 
review of conference/joumal submissions, to published reviews of literatures and 
books. In the context of annotation and peer review, we have described elsewhere a 
publishing toolkit [25] that converts a scholarly document into a structured discussion 
website, and an electronic journal [13] which uses this to support an innovative peer 
review model. 

We can think of conventional scholarly publication and debate as a document- 
centred, text-based process. Text is a rich medium in which to publish and discuss 
ideas in detail and with subtle nuances, but the corresponding disadvantage is that it 
takes a long time to read, and is hard to analyse computationally. We have 
demonstrated in previous work that a document-centric use of the Web can transform 
peer review in important ways [9], [39]. The complementary approach described in 
this paper, with different goals, focuses on the conceptual models implicit in textual 
documents and discourse. The goal is to provide a summary representation of ideas 
and their interconnections, in order to assist literature-wide analysis. We propose that 
this has advantages over textual media for tracing the intellectual lineage of a 
document’s ideas, and for assessing the subsequent impact of those ideas, that is, how 
they have been challenged, supported and appropriated by others. In addition, the 
availability of explicit conceptual models opens possibilities for automatic assistance 
in analysing a community’s (published) collective understanding of ideas. 

We begin with the idea that an author’s goal is to persuade the reader to accept 
his/her perspective, which constitutes a set of claims about the world. Usually, the 
author has some new ideas that s/he is contributing, and asserts particular 
relationships between these and existing ideas already published in order to 
demonstrate both the reliability of the conceptual foundation on which s/he is 
building, and the innovation and significance of the new ideas. The reader’s task is to 
understand which ideas are being claimed as new, and assess their significance and 
reliability. Moving to the task of literature search and analysis, in this case, the 
scholar has some ideas and relationships in mind that s/he is trying to locate in the 
literature — has anyone written about them, or perhaps these ideas exist but not yet in a 
single document? The interpretive task includes formulating the ideas of interest in a 
variety of ways that may uncover relevant documents, reading the documents and 
then interpreting them to characterise any patterns that appear to emerge. This is a 
similar scenario to that of a newcomer to a scholarly community (e.g. a student; 
librarian; lecturer or researcher from another discipline) who wants to know, for 
instance, what the seminal papers are, or if there are distinctive perspectives on 
problems or classes of technique that define that community. 

Scholars are poorly supported in these tasks by conventional library environments, 
physical or digital. Consider the document interpretation task. In the non-digital 
world, there is currently no way beyond following citations (only those provided by 
the author), or using citation indices (to find others citing him for some reason), to ask 
questions such as: 
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• Has anyone built on the ideas in this paper, and in what way? 

• Has anyone challenged this paper? 

• Has anyone proposed a similar solution but from a different theoretical 

perspective? 

Considering the literature analysis task, there is currently no way for a scholar to 
query a digital library with analytical/interpretive questions of the following sorts: 

• Are there any documents building on theory T, but which contradict each other’s 

predictions? 

• Are there any documents applying method M to domains D and E? 

• Is there any Web-based software which tackles problem P? 

• What impact did Theory T have? 

• Are there distinctive theoretical perspectives on problem P? 

These are the kinds of phenomena of most interest to scholars when they write papers, 
engage in debate or search the literature. These are also the kinds of questions asked 
by researchers unfamiliar with a literature, including students. Our approach seeks to 
provide better support for identifying significant conceptual structures that a research 
community considers important. 



3 Evolving the Web Beyond Simple Linking 

The World Wide Web is the first global hypertext system to emerge, providing a 
rudimentary infrastructure for publishing interlinked documents and discourses. This 
level of representational sophistication is already extremely useful for enhancing, 
even transforming, scholarly publication and discourse, as we have doumented 
elsewhere [9], [39]. However, the Web provides little support for structuring, 
searching or analysing scholarly concepts, documents or discourses. Early, pre-Web 
hypertext systems have already demonstrated (on a small scale) the power of features 
such as semantically typed nodes and links, bidirectional links, composite nodes 
which represent more complex structures, and structural searching. It is increasingly 
recognised that the Web would benefit from such features (see for instance an 
analysis of ‘fourth generation’ Web functionality [4]). 

Information retrieval using statistical and text analysis techniques include 
techniques for clustering and mapping documents based on semantic similarity [11], 
[12], and for inferring certain types of inter-document relationships (e.g. „cites“, or 
„summarises“) [1]. Automatic techniques clearly have the advantage that they can be 
applied to large text corpora with little human effort once the texts are in a suitable 
format for processing. From the perspective of scholarly interpretation, the key 
weakness of such techniques is that it is extremely hard, if not impossible, to 
automatically identify more complex kinds of scholarly relationships between 
documents such as those given above. Human-encoded document descriptions are 
required to express such scholarly claims and structures. 

We propose that when a new article is ready for dissemination, authors describe 
the document’s main contributions and relationships to the literature using a 
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controlled vocabulary analogous to a metadata scheme (but implemented using a 
formal ontology), and submit the description to a networked repository. We describe 
in the next section the concept of an ontology, and the representational scheme 
underpinning our approach. 



4 Representing Scholarly Claims 
4.1 Ontologies 

An ontology in philosophy refers to a model of what exists in the world [10]. The 
artificial intelligence community has appropriated the term to mean the construction 
of knowledge models [18], [32] which specify concepts or objects, their attributes, and 
inter-relationships. A knowledge model is a specification of a domain, or problem 
solving behavior, which abstracts from implementation-centered considerations and 
focuses instead on the concepts, relations and reasoning steps characterizing the 
phenomenon under investigation. Our application of knowledge modelling in this 
project is to implement a semantic network which expresses important aspects of the 
web of ideas and perspectives implicit in the documents and minds of a scholarly 
community. 



4.2 An ontology for Representing Scholarly Claims 

An ontology reflects a (typically community-wide) viewpoint on how best to 
conceptualise a particular domain or phenomenon. Hence, its main role is to support 
knowledge sharing and reuse. It might appear paradoxical, therefore, to propose the 
use of ontologies to support scholarly communities in managing their knowledge, 
since conflicting worldviews, evidence and frames of reference lie at the heart of 
research and debate. 

The key issue is in what is represented. Our approach builds on a relatively stable 
dimension of what are otherwise constantly evolving research fields, by representing 
scholars’ claims about the significance of ideas and concepts — a focus on discourse 
and argumentation (how scholars support and contest claims), and on context (the 
conceptual network in which an idea is embedded). In other words, it is hard to 
envisage when researchers will no longer need to make claims about, or contest, the 
nature of a document’s contributions (e.g. „this is a new theory, model, notation, 
software, evidence"), or its relationships to other documents (e.g. „it applies, 
modifies, predicts, refutes..."). Moreover, separating the representation of concepts 
from claims about them will be critical to supporting multiple perspectives. 

We are adopting a philosophy of ‘minimal ontological commitment’ [19] and 
incremental formalisation [37], which reflects an emphasis on making explicit just 
enough structure to be usefully expressive and enable the provision of valuable 
computational services, but leaving the document texts to express the details and 
nuances of an author’s argument (as opposed to trying to formalise it). This minimises 
the effort required to submit a document description; if there is evidence that authors 
wish to link ontological concepts to specific paragraphs within a document (e.g. as 
proposed by [23]), then we can provide a way to do so, but we will begin at the 
document level. The kind of core scheme we are moving towards, suitable for a wide 
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range of disciplines, is proposed in Figure 1, but it is both generalisable and tailorable 
to other fields (e.g. a more experimental field might specialise Idea into Hypothesis; 
another might not need Software). 
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Fig. 1. Representational elements for summarising the key contributions of a publication, 
and their relationships to other concepts. 

We hypothesise that a research community should be able to agree on a relatively 
small set of uncontroversial conceptual and relational types which can adequately 
express the majority of claims made. The goal is to design an ontology which is 
simple enough to understand without being simplistic, yet expressive enough that 
most researchers can represent the key claims made in most documents. 

More elaborate argumentation schemes have been proposed for computer- 
supported argumentation (e.g. [26], [34], [38], [40]), but our analysis of this literature 
shows little evidence of successful uptake (see discussion). The ScholOnto scheme 
therefore supports argumentation in relatively simple terms {supports, raises issues 
with, and refutes) to make it as easy as possible to add an argumentation link to a 
concept or document; more elaborate schemes can be introduced when there is the 
demand from a community. Moreover, whilst rigorous and carefully maintained 
argumentation networks make many kinds of useful analysis possible (the main 
motivation for increasing a schema’s expressiveness), such tools make assumptions 
about users’ expertise and consistency of representation that are unlikely to hold in 
the context of an open, internet community. The details of an author’s reasoning are 
therefore left to the document’s text, and are not made explicit in the knowledge base. 

To summarise, we propose that this represents a novel approach to the persistent 
problem facing any ontology development effort, namely, that the world being 
described is typically dynamic, necessitating resource intensive updating and 
restructuring. Shifting the representational focus to the way in which researchers 
make new contributions to the literature avoids the problem of committing to a 
domain ontology that quickly becomes outdated. The domain ontology is only 
constructed in the context of authors’ claims about their work, and can be contested 
by others. 
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5 Implementation 



5.1 Knowledge Modelling Infrastrncture 

Our approach relies on a suite of robust knowledge modelling technologies developed 
and tested in other domains. The OCML modelling language [30] supports the 
construction of knowledge models by means of several types of construct. It allows 
the specification and operationalization of functions, relations, classes, instances and 
rules. It also includes mechanisms for defining ontologies and problem solving 
methods [3], the main technologies developed in the knowledge modelling area. 
Problem solving methods are specifications of reusable problem solving behaviours. 
OCML has been used in several projects, in domains such as medicine, geology, 
engineering design and organizational learning. As a result the language is now 
associated with a large library of reusable models, providing a useful resource for the 
knowledge modelling community. In our scenario, OCML provides the formalism for 
defining our ontology for scholarly debate and interpretation, henceforth referred to as 
ScholOnto. 

WebOnto [14] enables knowledge engineers to collaboratively browse and edit 
knowledge models over the Web. The architecture is composed of a central server and 
a Java applet. WebOnto’s central server is built on top of a customised web server 
Lisp Web [33] and uses OCML as the underlying modelling language. In addition to 
implementing the standard HTTP protocol, the Lisp Web server offers a library of 
high-level Lisp functions to dynamically generate HTML pages, a facility for 
dynamically creating image maps, and a server-to-server communication method. The 
WebOnto Java applet provides multiple visualizations of OCML knowledge models, a 
direct manipulation and forms interface for creating new knowledge structures, and a 
groupware facility which supports both synchronous and asynchronous model 
building by teams of knowledge engineers (illustrated in Figures 2 and 9). Applied to 
the problem of managing scholarly knowledge concepts and documents in online 
research communities, these technologies provide the building blocks for a scaleable 
Web infrastructure. 

Further details on these tools and our approach to enriching documents with 
ontologies can be found in [31]. 



5.2 Ontology Design 

Figure 2 shows the top level structure of the ontology, as specified in OCML. Both 
nodes and links in the semantic network created by scholars’ submissions are 
SCHOLARLY -KNOWLEDGE -CONCEPTS. Nodes are SCHOLARLY- 

CONTRIBUTION- ELEMENTS, and links SCHOLARLY-RELATIONSHIPs, which 
are subdivided into ARGUMENTATION -LINKS and NON -ARGUMENT AT ION - 
LINKS. 
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Fig. 2. Class structure of the ScholOnto ontology, represented in the WebOnto system. 

Figure 3 shows the class definitions for the scholarly concepts of SOFTWARE, 
METHODOLOGY and LANGUAGE. SOFTWARE is defined as a SCHOLARLY- 
CONTRIBUTION- ELEMENT, which ADDRESSES PROBLEMS, USES/APPLIES 
any other kind of SCHOLARLY- CONTRIBUTION -ELEMENT (e.g. a METHOD or 
LANGUAGE), and MODIFIES/EXTENDS other kinds of SOFTWARE. 
METHODOLOGY and LANGUAGE are similarly defined. 

(def-class SOFTWARE (scholarly-contribution-element) 
((addresses : type problem) 

(uses-applies : type scholarly-contribution-element) 
(modif ies-extends : type software))) 

(def-class METHODOLOGY (scholarly-contribution-element) 
((addresses : type problem) 

(modif ies-extends : type methodology) 

(uses-applies : type scholarly-contribution-element))) 

(def-class LANGUAGE (scholarly-contribution-element) 
((addresses : type problem) 

(uses-applies : type language :type theory-model) 

(modif ies-extends : type language))) 

Fig. 3. OCML definitions of SOFTWARE, METHODOLOGY and LANGUAGE. 

The ontology is designed to support scholars in making claims by asserting 
relationships between concepts. Other scholars may support, raise-issues- 
with, or refute these claims. Figure 4 shows schematically the structure of a 
scholarly „claim“ in the ontology. The OCML specification associated with the 
structure in Figure 4 is shown in Figure 5. 
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Fig. 4. The structure of a scholarly „claim“ in the ontology. 

A claim is formally defined as a relation between a set of authors, who make a 
legal-scholarly-assertion, with some justification. A legal- 
scholarly-assertion is a statement instantiating a scholarly-relationship 
(e.g. addresses, predicts, refutes) between two elements (e.g. 
methodology X addresses problem Y). The justification is free text 
supporting a claim. An author will not be expected to enter this since the justification 
for their claim is already to be found in the document they are describing. But if 
another scholar, for example, supports, raises-issues-with, or refutes 
another’s claim (see scheme in Figure 1), without publishing an associated document, 
then some form of textual justification is expected. This could in turn point to a more 
rigorous justification in another document (ideally, directly accessible). 

(def -relation claims (?X ?Y ?Z) 

: constraint (and (set ?X) (every ?x author) 

(legal-scholarly-assertion ?Y) 
(justification ?Z) ) ) 

(def-class legal-scholarly-assertion (assertion) ?x 
: iff -def (and (assertion ?x) 

(== ?x (?a ?b ?c) ) 

(scholarly-relationship ?a) ) ) 

(def-class justification (string) ) 

Fig. 5. Separating claims from scholarly relationships; there can be many different 
(possibly contradictory) claims by different authors. 

To both support and refute a particular claim is usually a sign of 
inconsistency, or perhaps, of a position that has changed from an earlier paper. 
OCML’s envionrment makes it easy to construct rules that could, for instance, check 
for instances where an author is a member of two sets of authors who have made 
conflicting assertions (Figure 6). 
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(def -relation inconsistent-position (?auth ?assertion) 

: constraint (and (author ?auth) 
(legal-scholarly-assertion ?assertion) ) ) 

: sufficient (and (member ?auth ?x) (supports ?x ?y ?z) 
(member ?auth ?x2) (refutes ?x2 ?y ?z2))) 

Fig. 6. An OCML rule for an agent to check for positions that both support and 
refute a particular claim. This might reflect an inconsistent-position or at least, 
claims meriting closer examination. 

Encapsulating such rules in agents that researchers could select from a library and 
tailor is a scenario that the ScholOnto architecture aims to support. 



6 A Worked Example 

We now return to our opening examples of literature and document interpretation, and 
use a worked example to clarify how our tools could support scholarly work. Within 
the hypertext research literature, one of the landmark articles is the summary of the 
Dexter Hypertext Reference Model by Halasz and Schwartz [22], which specifies both 
semiformally, and formally (using the Z notation), abstract properties of hypertext 
systems, enabling comparison of existing systems, and specification of theoretically 
possible future systems. 

WebOnto already generates forms with contextual menus for adding new 
instances to an ontology (see [31]), but needs to be extended to generate forms that 
non-knowledge engineers can use. Figure 7 shows a prototype user interface for 
submitting the article’s description to the repository. The user interface will guide 
users through the schema using dynamic menus, and enables them to browse and 
search for existing concepts to assist their reuse. Some domain concepts are simple to 
reference (e.g. the name of a specific software system, framework or methodology), 
whilst others are less concrete and could benefit from information retrieval support, 
e.g. finding the name(s) used to describe a domain problem („user disorientation"), an 
idea („a global hypertext system"), or an empirical phenomenon such as a piece of 
evidence („low ability students benefit most from physics simulations"). 
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This document concerns: 

I Design Domain | Select from ACM CCS keywords (optional) 

|Hypertext/hypermedia [ H.1 MODELS AND PRINaPLES ^ 



( New... ) 



H.1 MODELS AND PRINCIPLES 
H.2 DATABASE MANAGEMENT 
(E.5) 

H.3 INFORMATION STORAGE 



Citation: 



Halasz, F. and Schwartz, M. (1994). Comm, of the ACM, 37 (2), 30-3 



URL 



www.acm.org/pubs/citations/journais/cac m/1 994-37-2/p30-h al (^Information 



▼ACM CCS 



informationSystems 



H.O GENERAL 

H.1 MODELS AND PRINCIPLES 
H.2 DATABASE MANAGEMENT^ 
(E.5) 

H.3 INFORMATION STORAGE 
ND RETRIEVAL 
4 INFORMATION SYSTEMS 
APPLICATIONS 



A 



This article describes a new: 



I Theory/Model 






I Dexter Hypertext Reference Model 



( New... ) 




•w Theory/Model 

▼ Design Domain 

Hypertext/hypermedia 



Dexter Hypertext Referen A 



Relationships to other articles/concept s: 

I Dexter Hypertext Reference Model 



Analyses | 


▼ Software | 




Augment ^ 

Concordia 

HyperCard V 


[▼Predicts | 


▼ Software | 




Theoretically possible Dexter compliant 



I Uses/Applies | |-y Specification Language | 



iant 



▼ Language 



'▼’Design Domain 



State Transition Network!.^ 
Task Action Grammar 
User Action Notation 
VDM 
Z 

St 



V 



■▼Problem 



Analyses ^ 


Software ^ 


Addresses ^ 


Methodology ^ 


Uses/Applies ^ 


Language ^ 


Modifies/Extends ^ 


Theory/Model ► 


Predicts ^ 


Phenomenon ^ 


Supports ^ 


Idea ^ 



Raises Issues With 
Refutes 



Problem 



:e of 

X 



[▼Design Domain 
|Hypertext/hypermedia 



Sub-Problem of... 
Variation on... 



( New... ) 



...absence of 



Commercial hypertexts ^ 
...absence of 
Educational hypertext 
...weak evidence 
Navigation 
...disorientation 
Standards 



Fig. 7. Protot 5 ^pe user interface for submitting claims about a document’s contributions. 



This form would generate an OCML entry in the ontology, as shown in Figure 8. 
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(def - instance dexter-htxt-ref -model-article article 
( (describes -scholarly- contribution -element 
dexter-ht- ref -model 

( concerns - domain hypertext -hypermedia ) 

(has-author halasz-f schwartz-m) 

(has-title „The Dexter Hypertext Reference Model") 
(publication-details „ Communications of the ACM, 37 
(2) , 30-39") 

(has-url „ www . acm. org/pubs/ art ides/ j ournals/cacm/ 
1994-37-2/p30-halasz/") 

(acm-ccs „I.7.2" „H2.1" „H3.2" „H5.1") 



(def - instance dexter-ht-ref -model theory-model 
( (addresses absence-of -standards 
(analyses notecards 
augment 
Concordia 
hypercard 
hyperties 
intermedia 
kms-zog 
neptune-ham) 

(envisages theore ti cal ly-possible- dexter -compliant - 
systems) 

(uses-applies Z) ) ) 

Fig. 8. The OCML entry for the Dexter article, declaring its contributions to the literature 
(dexter-ht-ref -model, which is a theory-model, and predicts theoretically 
possible systems), and its relationship to other concepts (analyses several existing systems, 
and uses-applies the Z notation). 

The article is now added to the ScholOnto knowledge base, enabling users to ask 
questions such as, „ What motivated the Dexter Hypertext Reference Model, and what 
impact has it had?" A forms-based interface, generated automatically from the 
ScholOnto ontology by WebOnto, enables users to ask such questions through simple 
menu selection (Figure 9). 
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Fig. 9. A forms-based interface generated automatically from the ScholOnto ontology by 
WebOnto enables users to query the model via menu selection. The screenshot shows several 
possible queries to analyse the motivation behind, and impact of, the Dexter Hypertext Model 
(we have combined them to save space; in reality one would most likely submit these as 
separate queries). The queries specify, respectively, (1) interest in the theory-model; 
dexter-hypertxt-ref -model, (2) what problems does it analyse?, (3) are there 
any theory-models which modify-extend it?, and (4) is there any software which 
uses-applies it? 

The knowledge base could generate filtered visualizations of the literature (e.g. based 
on the semantic network model in Figure 10) showing the Dexter Model’s motivation 
and conceptual roots (links to the left), and the nature of the work which has built on 
it since by the respective authors, or other researchers (links to the right). This is an 
illustration of the kind of concept map that could be generated by a ScholOnto search. 
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Fig.lO. A semantic network model provides the basis for generating visualizations of the 
literature. In this example, an author has described the relationships (links to the left) that 
motivate and situate an article’s key contribution (a Reference Model, marked by the central 
node) within its literature. Subsequent researchers (links to the right) have modified/extended 
the model, and implemented software systems based on their extended models (e.g. the 
„DeVise“ system, lower right). 



7 Discussion 

In this section we discuss some of the issues raised and possibilities opened up by the 
design we have presented, and contextualise our approach to related research. 



7.1 Intelligent Services 

A knowledge model enables inference-based searching and alerting. It will be 
possible to ask the system questions such as „What impact did Theory T have?“, since 
„impact“ can be defined, for example, in terms of the number of subsequent 
documents using or modifying it, the number of different domains in which it has 
been applied, the number of problems addressed which drew on the theory, and so 
forth. Our knowledge modelling environment makes it simple for us (as system 
maintainers) to write heuristics that could assist in finding relevant documents, e.g. „if 
Method Y extends Method X, and Method X is challenged, then Method Y may be 
challenged''. We will also assist scholars in composing their own rule-based interest 
profiles, e.g. „If 3 or more documents support Language L and challenge Language M 
(or any Languages based on them), and 3 or more documents support Language M 
and challenge Language L, then send me a concept map showing their 
interconnections" — since this may be evidence of two schools of thought. 

Another important advantage of our approach is that the existence of a formally 
represented knowledge model makes it possible to envisage additional reasoning 
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services on top of the ‘basic’ search support. For instance, it will be possible to 
develop specialized agents whose goal is to identify emerging perspectives, using 
heuristic knowledge and machine learning techniques. For instance, an agent could 
discover a ‘European perspective’ on a particular issue, by analyzing the geographic 
spread of the relevant positions. 

The complex notion of a ‘school of thought’ or theoretical perspective has 
representational form within ScholOnto. A perspective can be recognised by the 
common THEORY/MODELS to which a group of researchers appeals, the associated 
METHODS and LANGUAGES which they deploy, and the body of EVIDENCE that 
they mutually support. Conversely, the set of THEORY/MODELS, METHODS, 
LANGUAGES and EVIDENCE that they collectively RAISE-ISSUES-WITH may 
represent a different perspective. Tools to enable researchers to define such 
conceptual structures open intruiging new possibilities for literature tracking and 
analysis. 

The ‘impact’ or ‘authority’ of a piece of work can also be represented in a variety 
of ways in ScholOnto, depending on author preference. Beyond quantitative counts of 
how many documents cite a document or reuse a concept (directly, or through more 
complex inheritance), one could also declare an interest in a particular methodology 
or theoretical perspective (see above) as carrying more weight, and filter the literature 
on this basis. 



7.2 Usability of Knowledge Structuring Tools 

We are acutely aware that many schemes for registering shared resources and 
providing structured descriptions founder on the crucial ‘capture bottleneck’ - the 
envisaged beneficiaries of the system simply do not have the motivation or time to 
invest in sharing resources to reach a critical mass of useful material. We drew 
sobering lessons on this theme from an analysis of the computer-supported 
argumentation literature [7], corroborated by evidence relating to groupware [20], 
design rationale support [8], organizational memory systems [6], [35], and indeed, for 
many CSCW systems that require users to formalize information [36]. Specifically, 
we need to address issues raised by the usage patterns of hypertext systems such as 
Textnet [41] for scholarly annotation and linking, and descendants such as NoteCards 
[21] and Aquanet [28]. These provided rich schemas of semantic node-link types, but 
the limited adoption of these features led Trigg and others to conclude, correctly in 
our view, that rich taxonomies of node and link types (e.g. about 40 in Textnet) 
overwhelm users. Flowever, before eradicating all human-encoded semantic hypertext 
links from our systems, we argue that effective use of such systems may depend on 
finding the right mix of target domain, context of use and user community. 

Why should this proposal work where others have failed? Our target community, 
domain and use context have unique characteristics lacking in many other domains in 
which group memory, knowledge structuring and argumentation have not fitted well. 

• Research-oriented publishing. Academic and other research publishing, in contrast 
to other genres, emphasises the kind of careful argumentation and analysis of 
domain structure required, and possibly fostered, by this approach. (In addition, for 
teaching purposes, students can be required by a course to analyse the conceptual 
networks for a given literature, and perhaps construct their own concept maps as 
part of an assignment.) 
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• Strong motivation to disseminate work. Making an impact through publications is a 
primary activity for academics. Having completed a new document, the author will 
want to maximise its ‘digital presence’ on the net by carefully encoding its 
contributions and connections to the existing literature. 

• Opportunity to reflect on how to represent ideas. In contrast to the synchronous 
group working context in which many argumentation systems have been tested, 
scholars will be describing their documents in an individual setting, with time to 
reflect on how best to construct a network description. 

• A simple semantic schema complementing text. We hypothesise that most 
researchers will have no trouble in understanding node and link types such as those 
in - they are the concepts of everyday research discourse. Nor are we requiring an 
author to make explicit a document’s thesis at a fine granularity; the structural 
representation is a summary to assist the document’s discovery, not a substitute. 

• Benefits deriving from Web scalability. Previous research with pre-Web groupware 
and group memory systems has always focused on individual or small scale 
collaborative use. In a small team who already know each others’ work, it is often 
hard to justify the overheads of information structuring in order to track documents 
and debates. This is in sharp contrast to the challenge of tracking and analysing 
developments in an international, evolving digital library. Using the Web as our 
collaboration and delivery medium has a second advantage: the size of a Web- 
based research community increases the chances of quickly building a critical mass 
of users, which will in turn improve the value of the services provided. 



7.3 Researchers are Not Librarians 

Internet-based digital libraries of the sort that concern this project will change the 
roles of librarian and scholarly researcher established by paper-based, geographically- 
based libraries. For an internet-based library to scale realistically, with potentially tens 
of submissions arriving every day, the only people who can be expected to initially 
describe new articles are the people with most motivation to maximise the visibility 
and impact of the work — the authors. However, authors are not librarians or 
knowledge engineers who traditionally have possessed the skills to do information 
classification. This raises two issues: whether scholars are able to describe their 
documents sufficiently well to enable the system to make use of their descriptions, 
and how to make the underlying description technologies accessible and 
understandable. We follow initiatives to develop metadata schemes for the Web (see 
next section) in assuming that given intuitive representation schemes and user 
interfaces, domain experts will be able to submit useful descriptions of their own 
work. These are, however, empirical questions that will be addressed through lab- 
based studies studying detailed interaction with the system, followed by broader field 
trials once the system is deployed in different research communities. 



7.4 Emergent Work Practices 

If a significant proportion of a research community adopted a digital library 
infrastructure of the sort proposed, such that it became a de facto standard to register 
new documents on the server, it is likely that it would effect a shift in working 
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practices. In a more speculative mood, we briefly consider what forms these might 
take. 

The public face of an authors’ work would expand from his or her publications, to 
include the corresponding concept maps that others would see. Formulating the 
contributions of an article such that this map was succinct and to the point might in 
fact improve the authoring process. A research project or lab would most likely 
evolve a 'library' of conceptual elements and structures which they reused in different 
documents, and which represented their 'official' statement of how they wished their 
work to be linked to and reused by others. 

A research field has a small number of recognisable ‘genres’ of article which an 
active member of that field can usually recognise very quickly. Indeed, these are 
formally recognised by some conferences and journals in order to guide reviewers on 
appropriate review criteria for different kinds of submission (systems paper, theory 
paper, empirical paper, etc.). We can envisage the possibility of a journal establishing 
a set of templates for submitting and reviewing submissions in different categories, 
that is, „a paper of type-X should reasonably demonstrate in its conceptual network 
representation its relationships to A, B and C, and we expect arguments of the form P, 
Q and R. If the digital library is worth anything, it will of course assist reviewers 
in locating related work and debates that the author has omitted. 



7.5 Relationships to Other Research 

We have already addressed earlier hypertext and argumentation work; here, we situate 
our approach in relation to other relevant research and Web developments. 

Our technologies can be seen as a conceptual and technical development from 
current efforts to develop metadata description schemes for the Web. Metadata in the 
context of digital libraries refers to ways to encode information about resources in 
machine-interpretable formats, typically by completing a standard set of descriptive 
fields. Well known examples include USMARC [27] for library resources, Dublin 
Core [15] to provide a simple high level scheme for web resources, and IMS [24] for 
educational resources. Our scheme for representing multiple claims about the status of 
content extends typical metadata schemes which normally focus on encoding 
uncontroversial content attributes. From a representational and technical perspective, 
our approach differs from metadata in that ontologies support more sophisticated 
modelling, for example, specifying sufficient and necessary conditions for relations, 
and providing metalevel modelling support which makes it possible to reason about 
the ontology itself OCML in particular also provides powerful inference support 
making it possible to directly operationalise the ontology and its instantiation as a 
knowledge base. The W3C’s proposed Resource Description Framework [42] for 
interrelating multiple metadata schemes employs semantic network modelling similar 
to the knowledge modelling concepts presented here, and could provide a future route 
to interoperability with other systems and metadata schemes. 

The (KA)^ initiative (Knowledge Annotation for Knowledge Acquisition) [2] aims 
to support the knowledge acquisition community in building a knowledge base of its 
own research by populating a shared ontology. The knowledge base is constructed by 
authors annotating their web pages (e.g. publications; personal and project pages) 
with tags (analogous to HTML META tags), which can be read by a specialised 
search engine called Ontobroker [17]. The key architectural difference to our 
approach is that (KA)^ semantic tags are embedded in the physical content, whereas 
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our approach decouples content from claims about its status. This architectural 
difference reflects the different aims of the two enterprises. The aim of (KA)^ is to 
capture the contents of web pages in a formalism which can be reasoned about by 
Ontobroker. In our scenario we do not aim to represent directly the domain-specific 
content of a paper, but the debate about the scholarly status of that content. Moreover, 
we are concerned that authors will not be prepared to invest the required effort to 
encode models of document content, and as argued earlier, we cannot assume a stable 
ontology to describe an active research field. 

Ontologies are beginning to be used in the context of digital libraries, although for 
different purposes to those set out here. Ontologies can assist the extraction of 
concepts from unstructured textual documents [16], by serving as a source of 
knowledge about the particular topic. Ontologies can also assist in managing 
document descriptions in large digital libraries [43], [44]. 

Finally, our proposal builds on the use of automated pre-print servers for the 
submission and dissemination of documents (e.g. the Los Alamos server for physics 
and computer science <xxx.lanl.gov>, or the CogPrints server for cognitive science 
<cogprints.soton.ac.uk>). Flierarchical taxonomies and keywords assist search and e- 
mail alerting. A ScholOnto knowledge based server as proposed here extends this 
infrastructure by adding a semantically enriched layer of document encoding, with 
associated services, as discussed. One strategy for migrating from a conventional pre- 
print server to a ScholOnto server would be to import an existing pre-print server 
database, and enable authors to annotate their documents’ entries to give them a 
presence in the semantic network. 



8 Summary and Future Work 

The Web has established itself as a medium for research document dissemination. 
However, its support for many of the interpretive tasks that scholars perform is weak, 
despite the fact that semantic hypertext systems (as the Web was originally envisaged 
by Berners-Lee) are well suited to tasks such as structural searching, pattern analysis, 
and heuristic filtering. 

Our aim is to support scholarly analysis and discourse through the creation of 
author-centred and community-wide perspectives by researchers or software agents. 
We are developing a representational and technical infrastructure to complement 
libraries of archived documents with a ‘living’ semantic network of concepts. 
Adopted by a research community, this network could reflect the evolving 
understanding and recontextualisation of ideas over time (lost with archived 
documents), making possible new forms of literature analysis. 

The proposed ontology expresses scholarly claims about domain concepts, not the 
domain concepts themselves; existing metadata and knowledge modelling efforts 
focus on describing the content of resources, rather than discourse about content. We 
focus on the conceptual models implicit in textual documents and discourse in order 
to provide a summary representation of ideas and their interconnections. This has 
advantages over textual media for tracing the intellectual lineage of a document’s 
ideas, and for assessing the subsequent impact of those ideas. In addition, the 
availability of explicit conceptual models opens possibilities for automatic analysis of 
a community’s collective knowledge. 
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We are currently finalising the core ontology to assist interoperability across a 
wide range of research domains, and seeding some test literatures to evaluate 
WebOnto’s usability and automated services. We will then seek ‘early adopter’ 
communities interested in testing the ScholOnto server as a means of managing their 
own research knowledge. 
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Abstract. I show that the World Wide Web is a small world, in the 
sense that sites are highly clustered yet the path length between them is 
small. I also demonstrate the advantages of a search engine which makes 
use of the fact that pages corresponding to a particular search query can 
form small world networks. In a further application, the search engine 
uses the small-worldness of its search results to measure the connected- 
ness between communities on the Web. 



1 Introduction 

Graphs found in many biological and manmade systems are “small world” net- 
works, which are highly clustered, but the minimum distance between any two 
randomly chosen nodes in the graph is short. By comparison, random graphs are 
not clustered and have short distances, while regular lattices tend to be clustered 
and have long distances. Watts and Strogatz have demonstrated that a regular 
lattice can be transformed into a small world network by making a small fraction 
of the connections random [1] . 

Transitioning from a regular lattice to a small world topology can strongly 
affect the properties of the graph. For example, a small fraction of random links 
added to a regular lattice allows disease to spread much more rapidly across the 
graph. An iterated mutliplayer prisoner’s dilemma game is less likely to lead to 
cooperation if the connections between the players form a small world network 
rather than a regular lattice [1]. Costs for search problems such as graph coloring 
have heavier tails for small world graphs as opposed to random graphs, calling 
for different solving strategies [2]. 

So far, several man made and naturally occurring networks have been identi- 
fied as small world graphs. The power grid of the western US, the collaboration 
graph of film actors, and the neural network of the worm Caenorhahditis elegans, 
the only completely mapped neural network, have all been shown to have small 
world topologies. In the case of the graph of film actors, the distance between 
any two actors is found as follows: if the two have acted together, their minimum 
distance is one. If they have not costarred together, but have both costarred with 
the same actor, their distance is two, etc. 
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The concept of small worlds first arose in the context of social networks 
among people [3]. It has been estimated that no more than 10 or 12 links are 
required to go from any person to any other person on the planet via the re- 
lationship “knows,” where “knows” could be defined as “can recognize and be 
recognized by face and name.” The fact that relationships between individuals 
tend to form small world networks has been captured in several popular games. 
For example, in the game ’Six Degrees of Kevin Bacon’, one attempts to find the 
shortest path from any actor to Kevin Bacon. Because the graph of film actors is 
a small world, it is difficult to find any actor with a degree of separation greater 
than 4 with actor Kevin Bacon. There is also the Erdos number for scientists. If 
a scientist has published an article with the famous Hungarian mathematician 
Erdos, their number is 1, if they’ve published with someone who’s published 
with Erdos, their number is 2. 

In this paper I show that another man-made network, the World Wide Web, 
has a small world topology as well. Web sites tend to be clustered, but at the 
same time only a few links separate any one site from any other. This topology 
has implications for the way users surf the Web and the ease with which they 
gather information. The link structure additionally provides information about 
the underlying relationship between people, their interests, and communities. 



2 Finding Small World Properties in the Web 

Watz and Strogatz define the following properties of a small world graph: 

1 . The clustering coefficient C is much larger than that of a random graph with 
the same number of vertices and average number of edges per vertex. 

2. The characteristic path length L is almost as small as L for the corresponding 
random graph. 

C is defined as follows: If a vertex v has ky neighbors, then at most — 1) 

directed edges can exist between them. Let Cy denote the fraction of these 
allowable edges that actually exist. Then C is the average over all v. 

The first graph considered was the Web at the site level. Site A has a directed 
edge to site B, if any of the pages within A point to any page within site B. The 
data set used was extracted by Jim Pitkow at Xerox PARC from an Alexa crawl 
made approximately 1 year ago. It contains 50 million pages and 259,794 sites. 
Initially all links were considered to be undirected. From the 259,794 sites in the 
data set, the leaf nodes were removed, leaving 153,127 sites. An estimate of L was 
formed by averaging the paths in breadth first search trees over approximately 
60,000 root nodes. 84.5% of the paths were realizable, the rest were labeled 
with -1. The resulting histogram is shown in Fig. 1. 

L was small, a mere 3.1 hops on average between any two connected sites. 
C was 0.1078, compared to 2.3e-4 for a random graph with the same number of 
nodes and edges. 

Next, directed links were considered. This was a more natural interpretation 
of navigation between sites, since a user cannot move in the reverse direction 
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Fig. 1. Frequency of minimum path lengths between sites connected via undirected 
links 



on links using a standard browser. The largest strongly connected component 
(SCC), i.e. the largest subset of nodes such that any node within it can be 
reached from any other node following directed links, contained 64,826 sites. In 
order to sample the distribution of distances between nodes, breadth first search 
trees were formed from a fraction of the nodes. The corresponding histogram is 
shown in Fig. 2. 

L was slightly higher at 4.228 because the number of choices in paths is 
reduced when edges are no longer bi-directional. C was 0.081 compared to 1.05e- 
3 for a random graph with the same number of nodes and edges. In short, 
even though sites are highly clustered locally, one can still hop among 65,000 
sites following on average only 4.2 between-site links (note that there might be 
additional hops within sites that are not counted in this framework). There is 
indeed a small world network of sites. 

In order to have a more accurate comparison between the small world net- 
works for sites, and the corresponding random graphs, the subset of . edu sites was 
considered. Because the .edu subset is significantly smaller, distances between 
every node could be computed. 3,456 of the 11,000 .edu sites formed the largest 
SCC. C and L were computed for a generated random graph with the same 
number of nodes and directed edges. A comparison between the distributions of 
path lengths is shown in Fig. 3. 

L for the .edu graph was 4.062, similar to that of sites overall. This was 
remarkably close to L of the random graph : 4.048. At the same time C was 
much higher : 0.156 vs. 0.0012 for the random graph. The semi log plot in Fig. 4 
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Fig. 3. Frequency of minimum path lengths between . edu sites compared to a random 
graph with the same number of nodes and edges 
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shows the difference in the tails of the two shortest paths histograms. While L is 
almost the same for both graphs, long paths (of up to 13) occur for the .edu site 
graph. For the corresponding random graph the maximum path is 8 and long 
paths are unlikely. While the average shortest path was almost identical, the 
small world network distinguishes itself by having a few unusually longs paths 
as well as a much larger clustering coefficient. 




Fig. 4. Semilog plot of the frequency of minimum path lengths within the . edu site 
graph compared to a random graph with the same number of nodes and edges. 



In summary, the largest SCCs of both sites in general and the subset of . edu 
sites are small world networks with small average minimum distances. 



3 A Smarter Search Engine Using Small World 
Properties 

3.1 Introduction 

The above analysis showed that at the site level the Web exhibits structure while 
staying interconnected. One would expect a similar behavior at the page level. 
Related documents tend to link to one another while containing shortcut links 
to documents with different content. This small world link structure of pages 
can be used to return more relevant documents to a query. 

Links are interpreted as a citations of one document by another. Citations 
have been used to evaluate the impact of journals and authors [9] [10]. Here they 
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are used to identify quality web pages. Starting from the assumption that a 
good Web document references other good documents of similar content, one 
would expect that there exist groups of pages of similar content which refer to 
one another. The quality of these pages is guaranteed by the recommendations 
implicit in the links among them. Such groups can be extracted from results 
matching a particular query. Within each group there are documents which are 
good “centers”, that is, the distance from them to any other document within 
the group is on average a minimum. Centers tend to be index pages, and hence 
constitute good starting points for exploration of related query results. 

An application of these ideas was built around webbase, a repository of Web 
pages crawled by Google (http://www.google.com) in the first half on 1998. 
For any given search string, webbase returns queries ranked by a combination 
of their text match score and their PageRank[ll], which is based on the links to 
the document. Webbase also provides link information for each page. 

3.2 Outline of the Application 

1. Query webbase for docids corresponding to a particular search string. 

2. Identify all SCCs within the search results. 

3. Identify the largest SCC. 

4. Calculate L from each node in the largest SCC to find the best center. 

5. Form a minimum spanning tree via breadth first search from the best center 
(a graphical interface could guide the user down the tree). 

6. Compute C for the largest SCC. 

3.3 Observations 

SCCs usually contain pages belonging to the same site, because pages within a 
site are more likely to be linked to one another than pages across sites. A pref- 
erence should be given to SCCs spanning the most sites, because links across 
sites are stronger recommendations than links within a single site. SCCs con- 
taining the same number of sites are ordered by the number of documents they 
contain. A large number of interconnected documents implies a degree of orga- 
nization and good coverage of the query terms. Rather than presenting a list of 
documents that contains many sequential entries from the same site, the search 
engine can present just the center from each SCC. By sorting the centers by 
the size of their SCCs, one can present the user with the maximum span of the 
search space with the minimum number of entries. Given a good starting point, 
the users can explore the SCCs on their own. 

Informal observation suggests that pages which do not belong to any large 
SCC tend to focus on either a narrow topic, don’t have many outlinks, or don’t 
have many pages referencing them (which implies that they are probably not 
worth reading). When centers are sorted by the size of their SCCs, these docu- 
ments will be listed last. One further observes that the SCCs that span several 
sites tend to contain the main relevant pages, and are rich in “hubs” , or pages 
that contain links to many good pages. The algorithm will present these at the 
top of the list. 
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3.4 An Example 

A good example of how extracting SCCs and finding good centers can expedite a 
search, is in the query for ’’java”. Webbase returned 311 documents. The largest 
see contained 144 pages ranging over 28 sites, with the vast majority of pages 
belonging to java.sun.com/. 

If we look at the top 10 centers (see Table 1) among the 144 pages in the 
largest See, we see that they are compilations of links to java resources on the 
Web. Such pages are obvious good starting points. Since they are contained in 
the see, they must be linked to by at least one other page, which means that 
this list has been evaluated and deemed useful by at least one other source. Pages 
which are compilations of links tend to be subdivided into topics, and have brief 
human evaluations or summaries of the pages linked to. If a search engine is able 
to present the user with such man made sources of links, it could potentially 
save itself the trouble of ranking, clustering, or otherwise evaluating documents 
matching the search string. 

It is interesting to note that this procedure might not allow the search engine 
to return the best resource immediately, but with high probability the best 
resource is only one step away. For example, java.sun.com is not in the top 5 
centers, but all 5 centers link to it. One of the top five documents returned by 
webbase, www.gcrnielaui. com does not appear in the SCO. It has many back links, 
but no forward links, and hence is disconnected from the other pages. Still, 4 
of the top 5 centers listed reference Gamelan, so that, again, this important 
site is just a click away from the center. What if, on the other hand, one were 
interested in just the best single page on java? Then one could look for the page 
that has a minimum average distance to it from any other page in the SCC. As 
Table 2 shows, java. sun. com comes out on top, as do other good “authorities” 
- pages which are good single sources of information. Rather than revealing good 
starting points to explore the topic, this approach brings the user directly to the 
most authoritative pages on the subject. Just as once it was said that all roads 
lead to Rome, so it can be said that all links lead to the most important pages. 
In summary, in the “java” search string example, the largest SCC is a group of 
high quality resources spanning several sites. Pages with lowest average distance 
to any other page within the largest SCC are good hubs, i.e. pages with many 
links to good sources. Pages with lowest average distance to them from any other 
page in the largest SCC tend to be good sources of information, or “authorities” 
on java. 

Future work will include a comparison with the Kleinberg algorithm [8] which 
also finds “hubs” and “authorities” using the link structure of the Web. 



4 Marketing Applications 

The small world topology can facilitate navigation and extraction of useful in- 
formation, but it can also give clues to the structure of communities from the 
link structure of the documents they create. The Web today represents a wide 
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Table 1. Top 10 centers for the LSCC for the search string ’’java” 



Av. Min. Dist. URL & Title 



2.47222 


http : //www . inf ospheres . caltech.edu/resources/java.html 
Infospheres - Java Resources 


2.48611 


http : //www . apl . jhu. edu/ hall/ java/ 

Java Programming Resources: Java, Java and More Java. 


2.70138 


http : //sunsite .unc . edu/javaf aq/links .html 
Java Links 


2.73611 


http : //www . cat . syr . edu/3Si/ Java/Links .html 
3Si - Java Resources 


2.77777 


http : //www . december . com/ works /java/ info.html 
Presenting Java: Information Sources 


2.79861 


http : //www . javaworld. com/javaworld/ common/ jw- jumps .html 
Java World - Java Jumps 


2.93055 


http : //java. sun. com/about Java/ jug/ 
Java(TM) User’s Groups Info Page 


3.01388 


http : //javaboutique . internet . com/javaf aqs .html 
The Java(TM) Boutique: Java FAQs 


3.04166 


http : //sunsite .unc . edu/javaf aq/ 

Cafe au Lait Java FAQs, News, and Resources 


3.14583 


http : //java. sun. com/ 

The Source for Java(TM) Technology 



Table 2. Top 10 attractors of the largest SCC for the search string ’’java” 



Av. Min. Dist URL & Title 



1.90972 


http : // j ava . sun . com/ 

The Source for Java(TM) Technology 


2.29861 


http : // j ava . sun . com/ products/ 
Products & APIs 


2.3125 


http : //java. sun. com/applets/ 
Applets 


2.33333 


http : //java. sun. com/nav/used/ 
Java(TM) Technology in the Real World 


2.34722 


http : //java. sun. com/docs/ 
Documentation 


2.35416 


http : // j ava . sun . com/ nav/ developer/ 
For Developers 


2.53472 


http : //java. sun. com: 81/ 
Java Home Page 


2.63888 


http : //java. sun. com/ sf aq/ 
Frequently Asked Questions 


2.66666 


http : //java. sun. com/javaone/ 
JavaOne Home 


2.74306 


http : //java. sun. com/products/ activator/ 
Java Development 
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range of human interests. Some sites are devoted entirely to a single interest 
or cause. Others, such as Yahoo, have clubs or chat rooms where people can 
meet and share their ideas on particular topics. Many people document their 
interests and affiliations in their personal home pages. Therefore exploring the 
link structure of documents which belong to a particular topic could reveal the 
underlying relationship between people and organizations. 

To see what insight one could gain from identifying strongly connected com- 
ponents and average shortest paths, three search strings were issued to the search 
engine application outlined above: “abortion - pro choice”^, “abortion - pro 
life” 2, and “UFO”^. 

Although the pro choice results contained several sites devoted entirely to 
the issue, such as www.choice.org, www.cais.com, www.abortion.com, and 
www.prochoice.com, these sites did not appear to be linked to one another 
(i.e. there was no strongly connected component containing pages from more 
than one site). In fact, the largest strongly connected component was a group of 
pro life pages which had mentioned ’’pro choice” in their content. 

On the other hand, the pro life query results had a pro life strongly connected 
component of 41 pages, which spanned 16 sites. One could conclude that pro lifers 
not only have a stronger Web presence (804 vs. 645 documents returned for the 
two queries), but that the pro life community is more tightly knit, and possibly 
better organized. 

The results of the UFO query contained a largest connected component of 
95 pages, spanning 21 sites. Apparently there is a lot of interest in UFOs and 
UFO enthusiasts are interested in other’s sightings and speculations. 

The largest strongly connected components for all three queries had a high 
clustering coefficient and a small average shortest path, showing that groups 
of people with common interests are linked to one another via a small world 
network on the Web. 

How does this tie into marketing? Suppose one were interested in informing 
others of upcoming legislation regarding abortion. For example, a while back 
one could have either opposed or supported the partial birth abortion ban bill 
and wanted to start a red ribbon campaign. A ribbon placed on any site acts 
as a link to the main campaign site. The main site provides information about 
the campaign and allows others to download and include ribbons in their own 
sites. One could place one red ribbon in support of the bill in the middle of the 
pro life strongly connected sites and expect your ribbon to find its way to other 
pro life sites. On the other hand, if one wanted to start a black ribbon campaign 
in opposition to the bill, one would have to drop a black ribbon at several pro 
choice sites, because one would not expect the ribbon to propagate on its own. 
In general, one could reach a large community of people by placing an ad on a 
central page of an SCC. If the community is represented on the Web by many 



^ Data can be viewed at http://www.stanford.edu/~ladamic/data/pccenters.htm. 
2 Data can be viewed at http://www.stanford.edu/~ladamic/data/plcenters.htm. 
® Data can be viewed at http: //www. stanford.edu/~ladamic/data/ufocenters .htm. 
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small SCCs, the advertiser would need to place ads in many SCCs, in order to 
ensure reaching as much of the target audience as possible. 

5 Conclusions 

I have shown that the largest strongly connected component of the graph of sites 
on the Web is a small world. The graph of all sites and of the .edu subset has 
an average minimum distance between nodes that is close to that of a random 
graph with the same number of nodes and edges. At the same time both sets of 
sites are highly clustered. These two properties make the Web a small world, at 
least at the site level. I have developed a prototype of a search engine application 
that can take advantage of the small world networks present in documents cor- 
responding to particular queries. In the example of the ’’java” search string, the 
application could present theuser with documents which are good starting points 
for exploring, a maximum number of quality sites within a minimum distance 
or it could cut to the chase and return the highest quality documents directly. 
Finally, I have used the application to draw inferences about the connectedness 
of several communities of people that are represented on the Web and how it 
could influence advertising strategy. 
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Abstract. We present the Smart Object, Dumb Archive (SODA) model for 
digital libraries (DLs). The SODA model transfers functionality traditionally 
associated with archives to the archived objects themselves. We are exploiting 
this shift of responsibility to facilitate other DL goals, such as interoperability, 
object intelligence and mobility, and heterogeneity. Objects in a SODA DL 
negotiate presentation of content and handle their own terms and conditions. In 
this paper we present implementations of our smart objects, buckets, and our 
dumb archive (DA). We discuss the status of buckets and DA and how they are 
used in a variety of DL projects. 



1 Introduction 

The Smart Object, Dumb Archive (SODA) model for digital libraries (DLs) was 
developed within the context of NCSTRL+ [15], the joint NASA Langley Research 
Center and Old Dominion University extension of the Networked Computer Science 
Technical Report Library (NCSTRL) [1]. The premise of the SODA model is a 
separation of responsibilities: associating with digital libraries such traditional value- 
added services as indexing and searching; with digital objects their individual 
properties as distinguished from those of a collection; and with archives guaranteed 
access over a period of time. To this end we have developed buckets, data objects 
tailored for DL use, that enforce their own terms and conditions, negotiation and 
presentation of content, and maintain their own metadata. Many of the functions 
traditionally associated with archives have been „pushed down“ into the buckets 
themselves, resulting in „smarter“ objects and „dumber“ archives. 

Buckets are thus a special class of digital objects that are aggregative and intelligent 
agents. Buckets are DL-protocol independent, and due to their self-contained nature, 
can exist outside of DLs altogether. Buckets provide the mechanism (not the policy) 
for grouping related information objects into a single logical entity for archiving. We 
are designing buckets to contain intelligent agents so they can communicate with each 
other, people, and arbitrary network services as well as perform computational and 
self-modifying tasks. Buckets are described further in [14]. 
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Archives in the SODA model correspond to the lowest level - core archives - of the 
Kahn/Wilensky Framework (KWF) [2]. In that spirit we believe there should be only 
very limited functionality associated with an archive. Specifically, the functions an 
archive should support are those of add, delete, retrieve, list all objects and set/get 
metadata about the archive. 



1.1 Terminology 

Since there is no consensus on DL terminology, we use the following definitions for 
this discussion: 

• digital library services- the „user“ functionality and interface: searching, browsing, 
usage analysis, citation analysis, selective dissemination of information (SDI), etc. 

• archive - managed sets of data objects. DLs can poll archives to learn of newly 
published data objects, for example. 

• data object - the stored and trafficked digital content. These can be simple files 
(e.g., PDF or PS files), more sophisticated objects such as buckets. 

Figure 1 illustrates that hierarchical nature of DLs, archives, and buckets. A DLS is 
shown as a single entity, but this logical entity could be a distributed set servers. Note 
that although users can communicate with archives, we envision that archives will 
largely function only as middleware - enabling a DLS to locate buckets. Users will 
find buckets through a DL interface, and once found they will interact with the 
buckets themselves. 

Other DL models are possible (Table 1). The Smart Objects, Smart Archives (SOSA) 
model is possible, even likely to be the "default" DL of the future. However, to 
highlight the functionalities of buckets, we introduce them in the SODA context. Note 
that the Dumb Object, Smart Archive (DOSA) model describes the state of most 
current DLs, and the Dumb Object, Dumb Archive (DODA) model is an accurate 
description of early DLs. 



Table 1. Archive Design Space 





Smart Archives 


Dumb Archives 


Smart Objects 


SOSA 


SODA 




DL Example: none known 


DL Example: NCSTRL+ 


Dumb Objects 


DOSA 


DODA 




DL Example: NCSTRL 


DL Example: an anonymous 
FTP server with .ps.Z files 



1.2 Motivation 

We have been involved with a number of high traffic production NASA DLs since 
1994, including the Langley Technical Report Server [17], the NASA Technical 
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Report Server [16], and the NACA Report Server [12]. One thing we have observed 
from http log files is a surprising number of people do not find the NASA and NACA 
publications via the NASA and NACA DLs. Since the full contents of the NASA DLs 
are browsable, both the abstract lists and the reports are indexed by web crawlers, 
spiders and the like. Users are formulating complex queries to services such as Yahoo, 
Altavista, Lycos, Infoseek, etc. We presume this is indicative of the resource 
discovery problem: people start there because they do not know all the various DLs 
themselves; and the meta-searching problem: they are trusting these services to search 
many sources, not just the holdings of a single DL. 



Library Users 




Digital Library 
Services 




Digital Library 
Service Providers 




Fig. 1. Access in the DL Hierarchy 

Although we believe we have built attractive and useful interfaces for the NASA DLs, 
our main concern is that people have access to NASA content and not that they use a 
specific DL interface. It is desirable that NASA publications are indexed by many 
services. Since there are several paths to the information object, the information 
object must be a first class network citizen, handling presentation, terms and 
conditions, and not depending on archive functionality. Buckets implement the object 
as a first class citizen idea, and both buckets and the DA software facilitate greater 
dissemination of the material by making it easier for the holdings to be found and 
indexed by third party services. 
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1.3 Background 

The NCSTRL+ project is based on the creation of buckets and the extension of the 
Dienst [8] protocol. Dienst is a collection of DL services that receive messages 
encoded and transmitted via hypertext transfer protocol (http). Objects in Dienst are 
stored in directories, and are accessed through the Repository service. Metadata for 
the objects are stored in RFC-1807 format [9]. In addition to changing Dienst to 
properly handle buckets, we have added a new verb. Recluster, to the User Interface 
Service to assist in dynamically changing the display of search results. 



2 Buckets: Smart Objects 

Buckets are object-oriented container constructs in which logically grouped items can 
be collected, stored, and transported as a single unit. For example, a typical research 
project at NASA Langley Research Center produces information tuples: raw data, 
reduced data, manuscripts, notes, software, images, video, etc. Normally, only the 
report part of this information tuple is officially published and tracked. The report 
might reference on-line resources, or even include a CD-ROM, but these items are 
likely to be lost or degrade over time. Some portions such as software, can go into 
separate archives (i.e., COSMIC or the Langley Software Server) but this leaves the 
researcher to re-integrate the information tuple by selecting pieces from multiple 
archives. Most often, the software and other items, such as datasets are simply 
discarded. After 10 years, the manuscript is almost surely the only surviving artifact 
of the information tuple. 

Large archives could have buckets with many different functionalities. Not all bucket 
types or applications are known at this time. However, we can describe a generalized 
bucket as containing many formats of the same data item (PS, Word, Framemaker, 
etc.) but more importantly, it can also contain collections of related non-traditional 
STI materials (manuscripts, software, datasets, etc.) Thus, buckets allow the digital 
library to address the long standing problem of ignoring software and other supportive 
material in favor of archiving only the manuscript [21] by providing a common 
mechanism to keep related STI products together. The current semantics of buckets 
include a two-level structure: "elements", which are the unit of storage in buckets, and 
"packages", which are groups of elements. Figure 2 illustrates a typical bucket in a 
NASA DL application. 

Our bucket prototypes are written in Perl 5, and make use of the fact that Dienst uses 
http as a transport protocol. Like Dienst, bucket metadata is stored in RFC-1807 
format, and package and element information is stored in newly defined optional and 
repeatable fields. Dienst has all of a document’s files gathered into a single Unix 
directory. A bucket follows the same model and has all relevant files collected 
together using directories from file system semantics. However, this is 
implementation specific. The bucket API defines all operations on buckets. The 
bucket is accessible through a Common Gateway Interface (CGI) script that parses the 
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messages and enforces terms and conditions, and negotiates presentation to the 
WWW client. The bucket presentation format is currently encoded within the bucket, 
but we are currently planning to model presentation requirements using the Resource 
Description Framework (RDF) [10] to provide a mechanism for providing dynamic 
presentation templates that can exploit known semantics during presentation. 

The philosophy of Dienst is to minimize the dependency on HTTP. Except for the 
User Interface service, Dienst does not make specific assumptions about the existence 
of HTTP or the Hypertext Markup Language (HTML). However, Dienst does make 
very explicit assumptions about what constitutes a document and its related data 
formats. Built into the protocol are the definitions of PostScript, ASCII text, inline 
images, scanned images, etc. We feel that tightly coupling the DL protocol with 
knowledge of individual file formats reduces the flexibility of the DL protocol, 
making it less adaptable to new or locally defined data types and data relations. 




Fig. 2. A Typical NASA DL Bucket 

We favor making Dienst less knowledgeable about dynamic topics such as file format, 
and making such knowledge the responsibility of buckets. In NCSTRL+, Dienst is 
used as an index, search, and retrieval protocol. When the user selects an entry from 
the search results, Dienst would normally have the local User Interface service use the 
Describe verb to peer into the contents of the documents directory (including the 
metadata file), and Dienst itself would control how the contents are presented to the 
user. In NCSTRL+, the final step of examining the directory structure is skipped, and 
NCSTRL+ issues a URL redirect to the bucket. At this point, the user is 
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communicating with the bucket and not Dienst or any other archive. The default 
method for an index.cgi is the display method, so the user notices little difference in 
operation between NCSTRL and NCSTRL+. 

The full list of bucket methods are discussed in [14], Table 2 list some common 
methods defined on a particular test bucket. Embedding this archive-like 
functionality in the buckets comes at the expense of additional storage overhead, 
approximately 80KB per bucket. However, we consider this trivial in comparison to 
the size of typical NASA buckets (often several MBs), and with respect to the 
additional functionality of object intelligencce, object heterogeneity, DL protocol 
independence, and object mobility. 

Table 2. Some Sample Bucket Methods 
Methods and Arguments Description 



http :// dlib . cs. odu. edu/bucket/?method=display 
or 

http://dlib.cs.odu.edu/bucket/ 

http :// dlib . cs. odu. edu/bucket/?method=list_princi 

pals 

(this bucket’s appendices are restricted to „maly“ 
/ „maly“) 

http:// dlib. cs.odu.edu/bucket/?method=list_metho 
ds 

http :// dlib . cs. odu. edu/bucket/?method=list_source 
&target=display 

http://dlib.cs. odu.edu/bucket/?method=list_tc&tar 
get=display.tc 

http://dlib.cs. odu.edu/bucket/?method=list_logs 

http :// dlib . cs. odu. edu/bucket/?method=get_log&l 
og=access.log 

http://dlib.cs. odu.edu/bucket/?method=id 
http :// dlib . cs. odu. edu/bucket/?method=metadata 



Displays the bucket’s contents in 
HTML. The default method. 

Lists the principals (entries in 
the password file). Access can 
be restricted to specific 
principals. 

Lists all the methods known by 
this bucket. 

Lists the source code for the 
„display“ method. 

Lists the terms and conditions 
for the „display“ method. 

Lists the names of all logs kept 
by the bucket. 

Displays the access log. 

Displays the bucket’s handle. 
Returns the bucket metadata in 
RLC-1807 format. 



3 DA: Dumb Archives 

The use of buckets or other smart objects does not necessitate the use of dumb 
archives; it is possible to use buckets in a number of DL and WWW systems. 
However, we are implementing DA (dumb archive) as a reference implementation 
demonstrating the low level of functionality required for use in the SODA model. 
Table 3 lists the basic methods defined for DA. 
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The DA is essentially a set manager - notice the DA has no search capabilities. The 
DA’s purpose is to provide DLs the location of buckets (the DLs can poll the buckets 
themselves for their metadata) and the DLs build their own indexes. And if a bucket 
does not „want“ to share its metadata (or contents) with certain DLs or users, its terms 
and conditions will prevent this from occurring. For example, we expect the NASA 
digital publishing model to begin with technical publications, after passing through 
their respective internal quality control, to be placed in a NASA archive. The NASA 
DL would poll this archive to learn the location buckets published within that last 
week. The NASA DL could then contact those buckets, requesting their metadata. 
Other DLs could index NASA holdings in a similar way: polling the NASA archive 
and contacting the appropriate buckets. The buckets would still be stored at NASA, 
but they could be indexed by any number of DLs, each with possibility novel and 
unique methods for searching or browsing. Or perhaps the DL collects all the 
metadata, then performs additional filtering to determine applicability for inclusion 
into their DL. In addition to an archive's holdings being represented in many DLs, a 
DL could contain the holdings of many archives. If we view all digitally available 
publications as a universal corpus, then this corpus could be represented in N archives 
and M DLs, with each DL customized in function and holdings to the needs of its user 
base. Figure 3 illustrates this publishing model. 



Table 3. Methods for a Dumb Archive 



Method 


Description 


put 


insert a data object into the archive 


delete 


remove a data object from the archive 


list 


display the holdings of the archive 


info 


display metadata about the archive 


get 


redirects to the object’s URL or URN 



4 Discussion 

Although buckets and SODA were initially implemented using modified versions of 
Dienst, it should be stressed that neither buckets nor SODA require Dienst to operate. 
Indeed, a bucket design goal is to provide sophisticated digital objects for DLs that do 
not require Dienst, or any other specific DL protocol, to be used. In our internal 
applications, we regularly use buckets without Dienst. To applications that know to 
exploit them, buckets offer much functionality. To applications that are not bucket 
aware, buckets appear as regular HTML pages. For example, it would be easy to 
build a DL using a webcrawler search engine (such as Excite for Web Servers or 
Ultraseek Server). This would not be easily accomplished for data existing only 
within a Dienst archive. 

There may be situations in which buckets are unnecessary. For large homogeneous 
collections, the storage overhead and additional administrative work of managing both 
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buckets and archives may be undesirable. For a DL that may never be more than a 
DOSA or DODA DL, buckets are probably unnecessary. Flowever, buckets are 
motivated from of our own production DL experiences in which latent or creeping 
requirements demanded that dumb objects eventually become smart. 

Similarly, the motivation for SODA comes from our negative experiences in 
transitioning from one DL system to the next, and have the same body of content 
indexed by multiple systems. We believe that SODA builds the foundation for object 
mobility, object-level heterogeneity and DL protocol level heterogeneity. We intend 
to test these goals when we transition buckets and SODA from our prototypes to 
production NASA DLs. 




Fig. 3. The SODA Publishing Model 



5 Status 

The NCSTRL+ DL interface is based on our extensions to the Dienst protocol to 
provide a testbed for experimentation with buckets, clusters, and interoperability. 
"Clustering" is an advanced searching and browsing capability that allows dynamic 
clustering of holdings based on subject, institution, archival type and terms and 
conditions. The NCSTRL+ interface can be accessed at: 
http://dlib.cs.odu.edu/ncstrlplustool.html 
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Our long-term plans call for the conversion of the NASA DLs to buckets and 
NCSTRL-I-. At this point, the NCSTRL-l- project is converting the over 1800 items in 
the Langley Technical Report Server (LTRS) to buckets. The lessons learned in the 
LTRS conversion is described in [13], The buckets created in that conversion process 
were stored in a dumb, stand-alone archive which was then indexed into NCSTRL-i-. 
Therefore, having the handle for a bucket allows a user to retrieve the bucket from the 
archive; on the other hand the user can search NCSTRL-i- to find a bucket. In either 
case the bucket itself will handle the presentation. 

Additionally, we have developed a set of tools to aid in the creation, tracking and 
management of buckets and enforcing publishing and maintenance policies for 
archives. The tools have been used to create a testbed for NCSTRL+ which, at this 
time, runs on three NCSTRL-l- servers with index service for five archives. Since 
NCSTRL-I- can access other Dienst collections we can extend searches to all of 
NCSTRL, CoRR, and D-Lib Magazine as well. 

Other active bucket development areas include: the creation of „light-weight buckets" 
that provide a author specified subset of functionality to save on storage overhead 
and the creation of XML specified bucket ontologies. These will shield users from 
the two-level constructs of packages and elements and allow the storage and 
interaction of arbitrarily complex hierarchies that represent real world objects (i.e., 
„assignments“ within a „university class" bucket). 



6 Related Work 

There is extensive research in the area of redefining the concept of „document" or 
providing container constructs. In this section we examine some of these projects and 
technologies that are similar to buckets, as well as projects that similar capabilities as 
DA. Although buckets as intelligent agents is not described in this paper, we also note 
that we are unaware of other attempts to make archival entities intelligent. 



6.1 Bucket-Like Projects 

Buckets are most similar to the digital objects first described in the Kahn/Wilensky 
Framework [2], and its derivatives such as the Warwick Framework containers [6] 
and the more recent Flexible and Extensible Digital Object Repository Architecture 
(FEDORA) [18]. In FEDORA, DigitalObjects are containers, which aggregate one or 
more DataStreams. DataStreams are accessed through an Interface, and an Interface 
may in turn be protected by an Enforcer. The significant design difference between 
the KWF derivatives and buckets is that KWF DigitalObjects are tightly tied to the 
archive and protocol that hold and serves them. 
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Multivalent documents [19] appear similar to buckets at first glance. However, the 
focus of multivalent documents is more on expressing and managing the relationships 
of differing „semantic layers" of a document, including language translations, derived 
metadata, annotations, etc. There is not an explicit focus on the aggregation of several 
existing data types into a single container. 

E-commerce applications are producing a number of bucket-like projects. One 
example is IBM's cryptolopes [4], which are designed to allow for unlimited 
distribution of that data objects, but controlled access to their contents. Similarly, 
DigiBox [20] has been developed with the goal „to permit proprietors of digital 
information to have the same type and degree of control present in the paper world" 
[20]. As such, the focus of the DigiBox capabilities are heavily oriented toward 
cryptographic integrity of the contents, and not so much on the less stringent demands 
of the current average digital library. There appears to be no hooks to make either 
DigiBoxes or Cryptolopes intelligent agents. DigiBox and Cryptolope are commercial 
endeavors and are thus less suitable for our research purposes. 

To a lesser extent, buckets are not unlike some of the proposals from various 
experimental fdesystems and scientific data types. The Extensible File System (ELFS) 
[3] provides an abstract notion of „file“ that includes both aggregation, data format 
heterogeneity, and high performance capabilities (striping, pre-fetching, etc.). While 
ELFS is designed primarily for a non-DL application (i.e., high-performance 
computing), it is typical of an object-oriented approach to file systems, with generic 
access APIs hiding the implementation details from the programmer. 

The Hierarchical Data Format (HDF) and related formats (netCDF, HDF-EOS, etc.) is 
a multi-object, aggregative data format that is alternatively: raw file storage, the low- 
level I/O routines to access the raw files, an API for higher level tools to access, and a 
suite of tools to manipulate and analyze the files [11 22]. While HDF is mature and 
has an established user base, it is largely created by and for the earth and atmospheric 
sciences community, and this community's constraints limits the usefulness of HDF as 
a generalized DL application. It is worth noting, however, that buckets of HDF files 
are entirely possible and appropriate. 



6.2 DA-Like Projects 

DA is interesting because of what it leaves out, not what implements. As the name 
implies, there are any number of more sophisticated archive related projects and 
technologies. For example, the proposed Repository Access Protcol (RAP) [7] reveals 
many the same operations of DA (VERIFY, DEPOSIT, DELETE, etc.), but it defines 
separate explicit ACCESS operations for both the digital object and its metadata. Such 
concepts in SODA have been removed from the DA and placed within the bucket 
itself. 
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The Dienst protocol has some DA-like concepts as well. In particular, the Repository 
Service in Dienst implements a List-Contents verb, the LibMgt Service implements a 
Submit verb, etc. However, the main function of the Repository Service in Dienst is to 
regulate access to the items in the repository, through verbs such as Body and Page. 
Again, in SODA these functions are pushed down into the buckets. 

The Dienst research group have proposed a more recent service, the Collection 
Service [5]. This service is more like DA than the previous examples, in that its 
purpose is to group together arbitrary network objects based on some criteria. 
However, future plans for the collection service call for it to be involved in operations 
such as query routing, which are obviously beyond the scope of the DA. When the 
Collection Service is available for testing, it may be a good candidate to implement a 
SOSA model DL. 



7 Conclusions 

The SODA DL model was created for the NCSTRL+ project to facilitate DL 
interoperability and to increase the scope and nature of the availability of archived 
data objects. SODA shifts many functions associated with archives to the archived 
objects themselves. We have developed aggregative and intelligent archival entities, 
buckets, to handle this shift in responsibility. Buckets can exist in a number of DL 
archives, or outside archives altogether and responsible for presentation of the their 
contents and enforcing their own terms and conditions. To services that are not 
bucket-aware, buckets appear as ordinary HTML pages. The dumb archive we have 
developed, DA, provides just enough functionality to illustrate the role of an archive 
as the middle layer in the SODA DL hierarchy. SODA facilitates DL interoperability 
by clearly separating the roles of a DL, an archive, and the object itself. Finally, the 
SODA model facilitates wider dissemination of holdings by making it easy for third 
party services to find and index buckets. 
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Resume. Le WWW est devenu I’espace d’information universel anticipe 
et reve par son inventeur Tim Berners-Lee. Pour atteindre son plein 
potentiel, le WWW doit relever deux defis majeurs: devenir universel 
et passer a Lechelle. Etre universel signifie permettre a tons d’acceder 
I’information et publier sur le WWW. Le Web devrait done prendre 
en compte les vastes differences de culture, d’education, d’habilete, de 
moyens et les limitations physiques des utilisateurs de tous les continents. 
Supporter le passage a Lechelle veut dire, qu’alors meme que des millions 
de services sont accessibles sur le Web, I’infrastructure doit etre telle que 
les performances, la securite et la pertinence de Linformation continuent 
d’augmenter. 
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Resume. Les technologies de I’information ne fournissent pas seulement 
des ameliorations aux services standards des bibliotheques: elles intro- 
duisent un changement fondamental dans la facjon de creer, diffuser et 
utiliser I’information. Le passage d’un modele de publication centralise 
et ponctuel a un modele d’auto-publication decentralise et continu a deja 
commence. Cependant, certains des meilleurs aspects du modele actuel, 
comme 1’evaluation par les pairs, peuvent etre compromis, meme si de 
nouveaux services sont offerts. II faudra aussi s’efforcer de fournir des ser- 
vices appropries pour des donnees qui ne sont pas textuelles par nature, 
telles que les images, la video, ou les donnees scientibques. 

Beaucoup d’outils et de techniques peuvent permettre d’ameliorer et 
d’exploiter la vision de cette infrastructure emergente pour I’information. 
Un ensemble de techniques concerne les documents. Les documents mul- 
tivalents sont un nouveau genre de documents qui semblent utiles dans ce 
context. Ces documents sont i) eminemment ouverts, c’est-a-dire suppor- 
tant une variete de formats et de fonctions, ii) eminemment extensibles et 
adaptables de differentes famous a des besoins specihques des utilisateurs, 
iii) eminemment distribues, c’est-a-dire composes de parties qui peuvent 
se trouver sur des sites differents et etre composes dynamiquement pour 
former un document coherent. Un aspect particulierement interessant du 
modele est de permettre la ’’cooperation spontanee”, par la possibilite 
pour chacun d’annoter des pages Web, des images numerisees et autres 
ressources distantes, ressources sur lesquelles on n’a pas besoin de droits 
privilegies. 

Si les documents multivalents offrent certaines solutions pour manip- 
uler les ressources en ligne, trouver ces ressources est encore difficile, 
en particulier pour les images. L’analyse de contenu automatique per- 
met d’analyser le contenu d’objets informatifs pour en faciliter I’acces 
ulterieur. Nous presenterons certains developpements recents dans ce do- 
maine pour la recherche d’images, de photographies et de textes. 



S. Abiteboul, A.-M. Vercoustre (Eds.): ECDL ’99, LNCS 1696, p. 468—468, 1999. 
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Resume. Ce document resume les principales conclusions d’une etude a 
rUniversite de Californie du projet Museum Educational Site Licensing 
(MESL) - La premiere experience aux Etats-Unis de distribution d’images 
et de metadonnees a grande echelle. Cette etude a examine le cout et 
I’impact social de la distribution a des universites d’une grande quan- 
tite d’images et de metadonnees provenant de differents musees. Une des 
decouvertes principale est le fait que la distribution de donnees est une 
bonne chose chose pour une utilisation individuelle mais une mauvaise 
pour une utilisation collective, comme dans une classe d’eleves par ex- 
emple. Les obstacles a une plus large diffusion sont: des contenus pen 
clairs, I’abscence d’outils indispensables pour faciliter I’utilisation et le 
manque d’aide pour les universitaires utilisant cette nouvelle technologie 
dans leur enseignement. D’autres aspects sont aussi a ameliorer: la pos- 
sibilite d’enrichir les donnees avec des images venant d’autres sources, 
d’autoriser les instructeurs a modifier les informations descriptives et 
les commentaires des images, d’encourager la creation d’outils valorisant 
I’ensemble et de fournir des interfaces utilisateurs. L’etude a aussi com- 
pare les couts de la distribution digitale avec les couts de maintenence 
d’une bibliotheque de diapositives. 
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Resume. L’expansion rapide des collections numeriques multimedia met 
en evidence le besoin de classer non seulement les documents textuels, 
mais aussi les parties incluses non textuelles. Nous proposons un modele 
pour baser la classification des parties multimedia sur des caracteristiques 
generates, et montrons comment I’information sur des textes voisins cibles 
pent etre utilisee pour classer efficacement des photographies sur une 
seule caracteristique, en distinguant les images d’interieur ou d’exterieur. 
Nous examinons quelques variations d’une approche basee sur TF*IDF, 
analysons empiriquement leurs effets, et evaluons notre systeme sur une 
large collection d’images provenant de newsgroups actuels. De plus, nous 
examinons d’autres methodes de classihcation et devaluation, et les 
effets qu’une deuxieme caracteristique peut avoir sur la classification 
interieur/exterieur. Nous obtenons une precision de classihcation de 82 
%, ce qui depasse clairement les estimations de base et les approches con- 
currentes basees sur les images, et est comparable a la precision obtenu 
par un humain muni d’informations comparables. 
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Resume. Ce papier decrit la production d’un CD-ROM multimedia sur 
I’habitat rural en France et la fagon de le restaurer sans lui faire perdre 
son caractere traditionnel. Le message educatif est illustre a I’aide d’un 
grand nombre de photographies de maisons non renovees et de maisons 
renovees, ainsi que des commentaires explicites associes aux photos. Le 
papier porte principalement sur les metadonnees decrivant les photos 
et leur utilisation pour la generation automatique du CD-ROM. Nous 
decrivons tout d’abord notre utilisation du Dublin Core pour les meta- 
donnees que nous voulions interoperables et reutilisables. Nous avons 
etendu le Dublin Core par des descriptions detaillees en XML utilisees 
pour la generation de I’application specifique. Nous montrons ensuite 
comment generer les pages du CD-ROM avec notre systeme Norfolk, un 
generateur de documents virtuels, en definissant des documents HTML, 
on prescriptions, contenant des requetes aux metadonnees XML. Cette 
approche pent etre utilisee dans des applications tres diverses, de I’album 
virtuel de photos aux musees virtuels sur le Web. 
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Resume. Dans cet article, nous presentons un modele de bibliotheque 
electronique audiovisuelle developpe par la Direction de la Recherche 
Prospective de I’lNA. Nous decrivons comment, si Ton accorde un statut 
patrimonial a I’audiovisuel, il est fondamental de fournir aux utilisateurs 
de bibliotheques electroniques audiovisuelles non seulement des outils 
permettant de rechercher efiicacement des images et des sons, mais aussi 
des environements de lecture permettant d’interpreter ces images et ces 
sons comme des documents. Nous definissons les documents comme des 
temoignages d’une activite editoriales et nous montrons comment les 
utilisateurs des bibliotheques les utilisent dans le cadre d’une lecture ac- 
tive. Cette lecture, qui vise a I’ecriture d’un nouveau document, se base 
sur une contextualisation du document audiovisuel an sein d’un corpus 
structure de meta-information que nous nommons documentation. Cette 
documentation se compose de donnees d’indexation et de documents is- 
sus de lectures anterieures (tels que les dossiers de production, les fichiers 
du realisateur, les articles des critiques, etc.). 

En consequence, nous proposons un modele permettant aux utilisateurs 
de lire les documents audiovisuels non seulement en parrallele de leur 
documentation mais depuis leur documentation. Ce modele est base sur 
les concepts et les techniques de la publication electronique. II definit 
plusieurs niveaux de controle qui permettent d’assurer la coherence seman- 
tique, descriptive, structurelle et presentationnelle de la documentation. 
Cette coherence des meta-donnees permet la mise en place d’une chaine 
editoriale generant automatiquement des applications hypermedia. Nous 
decrivons une implementation prototypale de ce modele et nous mon- 
trons comment de telles applications hypermedia peuvent etre utilisees 
comme environnement de lecture pour ameliorer I’apprehension du con- 
tenu des documents audiovisuels par les utilisateurs des bibliotheques 
electroniques audiovisuelles. 



L’INA, Institut National de I’Audiovisuel, regroupe le centre d’archive de 
I’audiovisuel public francjais et le depot legal audiovisuel. L’INA archive des doc- 
uments televisuels depuis 1949 et radiophonique depuis 1929, son fond contient plus 
de 3 millions de documents, ce qui correspond a environ 400 000 heures de pro- 
grammes video et 500 000 heures de programmes audio. 
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Resume. Ce papier decrit une application qui permet la generation 
assistee de descriptions de metadonnees basees sur le Dublin Core, et 
de resumes visuels, numeriques, en ligne, de videos. C’est une applica- 
tion Java qui integre une fenetre d’affichage video avec des controles de 
type magnetoscope et des formulaires de saisie de metadonnees generees 
depuis un schema RDF hierarchique. La definition de schema est aussi 
utilisee pour valider les descriptions entrees par I’utilisateur et pour 
controler le format de sortie. Les descriptions des metadonnees generees 
peuvent etre sauvegardees en RDF, HTML ou dans une base de donnees. 
Elies peuvent etre utilisees pour permettre I’echange de metadonnees, 
pour chercher des donnees sur I’lnternet, ou pour la generation dy- 
namique de resumes visuels detailles pour la navigation video. Ce systeme 
prototype a ete congu pour I’unite audiovisuelle de la bibliotheque de 
I’etat de Queensland (State Libraty of Queensland’s (SLQ) Audiovisual) 
pour permettre une generation rapide, facile et efhcace de metadonnees 
normalisees. Ces metadonnees peuvent etre utilisees pour la creation de 
resumes visuels detaillees en ligne pour les dernieres acquisitions video. 
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Resume. La musique est un composant importante des bibliotheques 
numeriques. Ce papier porte sur une bibliotheque de musique numerique 
permettant la recherche de musique, et propose une methode pour ex- 
traire des sequences representatives. Celles-ci sont ensuite utilisees pour 
proposer des versions raccourcies aux utilisateurs. La methode comporte 
deux etapes, I’extraction des sequences et la classification syntaxique 
des fragments de melodie. L’extraction de sequence est realisee a par- 
tir de regies heuristiques. Nous avons mene une etude sur la hdelite de 
I’extraction de sequences a partir de 94 chansons japonaises et avons 
obtenu des taux de 0.766 pour le rappel et 0.786 pour la precision. La 
classihcation syntaxique est basee sur I’analyse probabiliste de motifs 
syntaxiques en combinant la classification et I’analyse syntaxique. La 
methode proposee utilise un arbre de decision et un automate a etats fini, 
et a obtenu un taux d’exactitude de 0.884 dans I’extraction de sequences 
represent at ives . 
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Resume. Dans le but d’ameliorer Teflicacite des methodes de recherches 
qui retournent des resultats trie s, nous presentons une approche qui 
repose sur une presentation graphique de “vues” des resultats et leur 
manipulation par I’utilisateur. Une vue est un sous-ensemble des doc- 
uments retournes qui contient un sous-ensemble particulier de termes 
de la requete. Cette approche a ete implantee dans un systeme appele 
VIEWER (VIEwing WEb Results), qui sert d’interface a divers mo- 
teurs de recherche. Une evaluation experimentale des performances de 
VIEWER compare a AltaVista est le principale sujet du papier. Nous 
montrons d’abord les resultats d’une experience sur un unique et simple 
type de recherche oil VIEWER, utilise comme un systeme de classement 
automatique, donne des resultats tres superieurs a ceux d’Altavista. Nous 
nous attaquons ensuite a un scenario de recherche plus realiste, avec 
des requetes libres, un nombre non limite de resultats selectionnes, et 
eventuellement une reformulation de la requete. Les resultats de I’experi- 
ence semble montrer que I’utilisateur de VIEWER, contrairement a celui 
d’Altavista, porte son effort sur 1’evaluation des resultats plutot que 
sur leur consultation, pour une plus grande eflicacite et satisfaction de 
I’utilisateur. En particulier, nous avons trouve que les utilisatuers de 
VIEWER selectionnent deux fois moins de documents non pertinents 
que les utilisateurs d’Altavista, tout en selectionnant autant de docu- 
ments pertinents. 
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Resume. Les techniques d’expansion automatique de requetes a partir 
des premiers documents retournes se sont montre es prometteuses pour 
ameliorer Teflicacite les recherches parmi de larges collections. Cepen- 
dant ces methodes n’ont pas encore ete evaluees et comparees de maniere 
systematique. Dans ce papier nous mettons I’accent sur des methodes 
d’evaluatiou de termes basees sur la difference de distribution entre les 
termes apparaissant dans les documents (pseudo-) relevants et ceux ap- 
paraissant dans la collection toute entie re. Cette approche est vue ici 
comme un complement ou une alternative a des approches plus tradition- 
nelles. Nous montrons que lorsque Ton utilise des methodes base es sur 
de telles distributions pour choisir I’expansion des termes en restant dans 
les classiques schemas de ponderation de Rocchio, les resultats globaux 
ont pen de chance d’eatre meilleurs. Cependant, nous montrons aussi que 
lorsque ces mea mes methodes de distribution sont utilisees a la fois pour 
choisir et ponderer les termes de I’expansion, I’eHicacite de la recherche 
pent augmenter considerablement. Au regard des variations en perfor- 
mances des differentes me thodes sur differentes requetes, nous pensons 
qu’il est possible de combiner I’ensemble des termes ordonnes suggeres 
par chacune de ces methodes, pour ameliorer encore les performances en 
moyenne. Les resultats experimentaux pr esentes confirment cette these 
et montrent qu’en expansant automatiquement les requetes il est possi- 
ble d’ameliorer les performances de 21.34% par rapport a des requetes 
non expansees. Nous discutons aussi I’effet que divers param etres, tels 
la difhculte de la requete, le nombre de documents choisis, ou le nombre 
de termes choisis, peuvent avoir sur I’efficacite de la recherche. 
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Resume. La decouverte de ressources dans une bibliotheque digitale 
distribuee pose de nombreux defis; I’un d’entre eux est de choisir des 
moteurs de recherche pour la distribution de requetes etant donnes une 
requete et un ensemble de moteurs de recherche. Ce papier porte sur 
la performance des moteurs de recherche en tant que critere pour la 
selection du moteur de recherche et dehnit deux mesures de la perfor- 
mance d’un moteur de recherche: disponibilite - le moteur repondra- 
t-il dans une limite de temps, et temps de reponse - a quelle vitesse 
repondra le moteur de recherche, s’il repond. Nous predisons ces deux 
caracteristiques de performance avec differents algorithmes qui requierent 
tons un temps d’evaluation faible et gardent un enregistrement succint 
des perfomances precedentes pour chaque moteur de recherche. Nous 
avons utilise des donnees operationnelles venant de la bibliotheque dig- 
itale distribuee NCSTRL pour creer et evaluer nos previsions, et nous 
avons trouve que les methodes de previsions simples marchaient aussi 
bien que les methodes plus complexes et que I’exactitude de la prevision 
etait etroitement liee a la consistence des donnees. 
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Resume. Puisque les bibliotheques de donnees numeriques actuelles de- 
viennent de plus en plus complexes, les possibilites offertes par celles-ci 
vont s’accroitre ainsi que la difiiculte d’apprendre a les utiliser. Pour creer 
des systemes interactifs utiles et conviviaux, les concepteurs ont besoin 
de s’assurer que de bons criteres de conception sont incorporees dans 
leurs systemes, ahn de prendre en compte les besoins et les references 
culturelles des utilisateurs. Nous avons realise une etude pour etablir les 
bons criteres de conception a avoir a I’esprit quand on congoit des bib- 
liotheques de donnees numeriques. L’etude fournit un apergu de I’impact 
que peut avoir I’utilisation d’une bibliotheque de donnees numeriques 
pour achever une tache et ce que les utilisateurs pensent de I’efHcacite 
de telles bibliotheques. Les resultats suggerent aussi que peu de soin est 
donne dans I’interface aux besoins de navigation de I’utilisateur ni aux 
aspects multiculturels. Par consequent, ce papier discute aussi les lignes 
directrices pour la conception de bibliotheques de donnees numeriques 
centrees sur I’utilisateur. 
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Resume. L’objectif principal d’un fournisseur d’information est de sat- 
isfaire les besoins des utilisateurs, c’est-a-dire de fournir la bonne in- 
formation an bon moment et sous la bonne forme a I’utilisateur. Une 
condition an developpement de services personnalises est de reposer sur 
des profiles d’utilisateurs representant les besoins des utilisateurs. Dans 
ce papier, we considerons en premier lieu le probleme de la representation 
d’un modele d’utilisateur general. Ce modele est ensuite adapte aux util- 
isateurs de bibliotheque electronique. 
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Resume. La quantite d’information textuelle disponible devient tene- 
ment importante qu’il devient necessaire d’etudier de nouvelles tech- 
niques pour assister les utilisateurs qui veulent acceder a celle-ci. Ces 
techniques se nomment en anglais I A (Information Access). Dans ce 
papier, nous proposons d’utiliser un systeme convivial qui s’adresse a 
I’utilisateur et qui resume I’information disponible. Celui-ci aide les util- 
isateurs a choisir le document le plus pertinent correspondant a leur 
recherche. Une methode d’extraction de phrase sert a generer les resumes. 
Un score est donne a chaque phrase. II est calcule en utilisant des heuris- 
tique qui se sont revelees tres efficaces dans des travaux precedents (mots 
cles, titre et emplacement). Nous nous sommes inspires des requeates 
dans un systeme d’acces a de I’information et des techniques d’elargisse- 
ment du champ de recherche d’une requete utilisateur utilisees dans 
Wordnet pour representer les besoins des utilisateurs. Nous presentons 
une methode d’evaluation systematique et objective servant a mesurer 
I’efficacite de Lutilisation de resumes dans deux exemples d’utilisation 
tres frequentes d’un systeme d’acces a I’information: recherche ad-hoc 
et mesure de la pertinence des documents trouves. Les resultats obtenus 
confirment nos hypotheses initiates, c’est a dire, des resumes d’information 
sont un outil tres utile pour assister des utilisateurs dans la recherche 
d’information textuelle. 
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Resume. La recherche d’information sur le Web est un probleme auquel 
font face les utilisateurs. Get article decrit Pharos, un nouveau service 
developpe pour aider les utilisateurs de Web a parager la connaissance 
qu’ils en out. Pharos respose sur une infrastructure de collaboration qui 
permet a des groupes d’utilisateurs de cataloguer et d’evaluer des docu- 
ments a des sur un sujet donne. Ces donnees, qui peuvent etre subjec- 
tives, sont synthetisees afin de produire des recommandations person- 
nalisees. Le passage a I’echelle du systeme est assure par la distribution 
des serveurs et la replication de leur bases de donnees. Pharos a ete 
implemente en Java et est actuellement en cours d’evaluation. 
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Resume. Nous presentons une nouvelle approche pour la construction 
de schemas RDF, en exploitant des ontologies et vocabulaires structures 
(thesauri) existants. Cette approche est fondee sur la specification de 
relations d’inclusion entre des termes d’un thesaurns et des concepts 
d’une ontologie. Nons allons montrer comment ces relations peuvent etre 
exploitees pour la generation de schemas RDF qui incorporent la vne 
strncturelle d’nne ontologie et le schema de classification fonrni par le 
thesaurus. 
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Resume. Ce papier presente une nouvelle methode pour permettre aux 
utilisateurs de bibliotheques electroniques d’organiser et d’annoter leur 
documents. Nous avons etendu le concept d’hypermedia ouvert en intro- 
duisant des liens types qui permettent: I’addition de semantique (definie 
par I’utilisateur) aux hypertextes, la navigation, et I’analyse et la synthese 
automatique des structures hypermedia. Le systeme hypermedia ouvert 
Webvise est integre au WWW, et a ete augmente par un systeme de type. 
Nous illustrons I’utilisation potentielle dans le contexte des bibliotheques 
electroniques a I’aide d’un scenario d’enseignants qui preparent un cours 
en commun base sur un materiel de bibliotheques electroniques. 
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Resume. Get article presente un aspect des ontils et des methodes pour 
la recherche d’informations multilingues qui out ete developpees dans le 
projet Twenty-One. Les ontils et les methodes sont evalues avec la col- 
lection multilingue de TREC utilisant des requetes hollandaises pour 
rechercher les documents anglais. La question principale concerne une 
evaluation de deux approches de desambiguisation: faut-il s’efforcer de 
trouver la traduction correcte pour chaque mot de la requete avant de 
commencer toute recherche, ou faut-il rechercher avec plusieurs traduc- 
tions pour chaque mot de la requete? L’etude experimentale suggere 
que la qualite des methodes de recherche est plus importante que la 
qualite des methodes de desambiguisation. Ce sont les bonnes methodes 
de recherche qui peuvent desambiguiser le mieux les requetes traduites 
implicitement pendant la recherche. 

Mots-clefs: Recherche d’informations multilingues, traduction automa- 
tique statistique. 
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Resume. Dans cet article, nous decrivons un systeme de recherche d’in- 
formation interlingue qui rend possible I’interrogation de bases de donnees 
multilingnes. Le CEA doit en effet pouvoir traiter nn nombre important 
de documents et de bases de donnees multilingnes. C’est la raison pour 
laqnelle nous avons modifie le SRI que nous employons, SPIRIT, qui re- 
pose sur des traitements linguistiques et statistiques elabores. Nous avons 
ainsi mis au point un systeme d’interrogation interlingue fonde d’une 
part sur I’indexation de documents nontenant des parties ecrites dans 
des langues differentes, et d’autre part snr la reformulation bilingue des 
requetes. Cette derniere offre toutes les tradnctions possibles des mots 
signihcatifs de la reqnete, et les documents servent de filtre lorsqu’il y 
a incertitude ou ambiguite. Les reponses a nne requete sont proposees 
sous la forme d’une liste de classes de documents ordonnees en fonction 
de leur pertinence. Cet article decrit I’application de ces techniques a 
I’interrogation interlingue de catalogues et de bases de donnees bibli- 
ographiqnes. 
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Resume. Nous proposons une technique d’elargissement du champ de 
recherche d’une requete pour ameliorer la recherche, dans plusieurs langu- 
es, de mots ou d’expressions dans un document (meme si la requete con- 
tient les mots ou expressions recherches dans une seule langue, le systeme 
va rechercher leurs equivalents dans plusieurs langues). La recherche de 
mots ou d’expressions en plusieurs langues est fondee sur I’utilisation 
d’un dictionnaire et se nomme en anglais CLIR (Cross-Language Infor- 
mation Retrieval). Pour resoudre les ambiguites de traduction de mots, 
nous utilisons une technique fondee sur la similarite de mots. Nous avions 
deja propose cette technique dans un travail precedent. Elle servait a ren- 
dre plus juste la traduction d’une requete a I’aide d’un dictionnaire. La 
technique employee pour elargir le champ de recherche d’une requete 
est alors utilisee pour traduire les requetes afin de rendre la recherche 
plus pertinente. Nous montrons que combiner les deux techniques afin 
d’exprimer des requetes de recherche de mots ou d’expressions en trois 
langues (allemand, espagnol et indonesien) est efficace. Ces requetes ser- 
vent a interroger une collection de documents ecrits en anglais dans le 
standard TREC (en anglais. Text Retrieval Conference). Nos experiences 
montrent que, plus il y a d’expressions dans les requetes, plus les tech- 
niques fondees sur la similarite de mots sont performantes. De plus, nos 
resultats vont dans le meme sens que des travaux de recherche deja ef- 
fectues. Ceux-ci montraient que la reconnaissance d’expression et la tra- 
duction sont cruciaux pour I’efficacite de la methode CLIR. 
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Resume. Les bibliotheques numeriques ont suscite un immense interet 
et plusieurs projets de recherche relevent d’importants defis dans ce do- 
maine. Tandis que des systemes de calculs intelligents ont ete utilises 
pour des problemes specifiques connexes, la plupart des projets utilisent 
des techniques classiques pour la structure de base de la bibliotheque elle- 
meme. Le projet SOMLib a cree un systeme de bibliotheque numerique 
qui utilise un reseau neuronal comme noyau du systeme. Un modele 
bien connu de reseau neuronal non supervise, la carte qui s’organise 
d’elle-meme, est utilise pour structurer par sujets une collection de doc- 
uments, comme dans I’organisation d’une bibliotheque reelle. Bases sur 
ce noyau, d’autres modules fournissent des fonctionalites de recherche 
d’information, integrent des bibliotheques distribuees, et etiquetent au- 
tomatiquement les differentes sections de la collection de documents. De 
plus, une interface basee sur une metaphore graphique guide I’intuition 
de I’utilisateur en lui fournissant instantanement une vue globale. 
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Resume. Ce rapport decrit le developpement d’une bibliotheque nume- 
rique europeenne pour la litterature grise. Le but etait de mettre a dis- 
position des scientifiques travaillant dans les domaines de la technolo- 
gie de I’information et des mathematiques appliquees une bibliotheque 
numerique qui pourrait aussi etre utilisee comme banc d’essai pour des 
activites de recherche. Ce service, mis en place an sein du NCSTRL 
(US Networked Computer Science Technical Reference Library), a ete 
developpe a partir du systeme DIENST, utilise par NCSTRL, afin de 
I’adapter aux besoins de la communaute scientihque europeenne. Ce rap- 
port decrit les fonctionnalites complement aires qui y ont ete ajoutees, 
tout en traitant des difficultes rencontrees lors de leur integration dans 
une architecture, un protocole et un systeme preexistants. 
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Resume. Ce papier decrit brievement a la fois les problemes techniques 
et organisationnels qui sont survenus lors de la mise an point d’une bib- 
liotheque de donnees numeriques a I’Universite de Crete, que vous pou- 
vez trouver a http://dlib.libh.uoc.gr. Depuis quelques annees, nous avons 
decrit et analyse nos approches et nos experiences pour mettre en service 
une bibliotheque de donnees numeriques a partir de plusieurs collections. 
Nous avons eu besoin d’analyser I’objectif de la bibliotheque et ce que 
les utilisateurs en attendaient,pour choisir le logiciel approprie, pour ren- 
dre flexible la conception de nouvelles fonctionnalites, pour adapter et 
etendre le logiciel choisi aux besoins actuels, pour installer et configurer 
le logiciel, pour I’ameliorer en utilisant a bon escient les reactions des 
utilisateurs et pour interagir avec les auteurs des documents et les bib- 
liothecaires. Le but est de rendre la bibliotheque de donnees numeriques 
conviviale, facile a utiliser et a maintenir et de permettre aussi la collecte 
et la numerisation des donnees de la bibliotheque. Le systeme final est 
gere par le personnel actuel de la bibliotheque. 

Les principaux problemes techniques sont lies a la conception, I’implan- 
tation et la mise en application des fonctionnalites d’une bibliotheque de 
donnees numeriques, telles qu’une interface et un stockage en plusieurs 
langues, la generalisation du logiciel pour permettre une recherche sur 
des collections heterogenes et supporter le protocole Z39.50, la fourniture 
d’outils qui simplifient la configuration, I’administration et I’ajout de 
donnees dans la bibliotheque, ainsi que des outils pour creer ou modifier 
les metadonnees et pour enregistrer les donnees lors de la soumission de 
nouveaux documents. 
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Resume. Z39.50 est un protocole client/serveur largement repandu dans 
les bibliotheques et les musees numeriques, pour rechercher et retrouver 
I’information repartie dans un certain nombre de sources heterogenes. 
Pour surmonter les disparites semantiques et schematiques entre les di- 
verses sources de donnees, le protocole est base sur une vue du monde 
d’information comme une liste plate de champs, appeles Points d’Acces 
(PA). Le principal probleme pour developper des traducteurs Z39.50 
est de mettre en correspondance cette liste non structuree de PAs avec 
les donnees de la source sous-jacente. Malheureusement, les traducteurs 
Z39.50 existants ne fournissent pas des langages de correspondances de 
haut niveau avec des proprietes verifiables. Dans cet article, nous pro- 
posons une boite a outils basee sur une logique descriptive, qui permet 
une specification declarative des traducteurs Z39.50. Nous affirmons que 
la conceptualisation des correspondances des PAs permet une validation 
formelle de la qualite de la traduction de requetes et assure done la 
qualite des donnees retrouvees. File permet aussi d’attaquer certains des 
problemes Z39.50 ouverts (par exemple, la recherche de metadonnees, 
I’echec des requetes dues a des PAs non traduits, etc...) en enrichissant 
les traducteurs Z39.50 generes avec un certain nombre de services tels 
que la structuration conceptuelle des vocabulaires Z39.50 plats et des 
aides intelligentes a 1’evaluation de requetes Z39.50. 



* Ce travail a ete partiellement finance par le projet Europeen AQUARELLE (Telem- 
atics Application Programme IE-2005) et le projet de tests d’interoperabilite de 
CIMI. 

** Ce travail a ete realise pendant que I’auteur etait a ICS-FORTH. 
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Resume. Ce papier donne les motivations et definit un systeme genen- 
erique pour des catalogues de produits (en ligne ou non). Base sur un 
cahier des charges detaille, le modele de donnees est defini en utilisant une 
notation de conception orientee-objets, et le language de requetes utilise 
pour exprimer I’interet du consommateur est defini en utilisant la theorie 
des ensembles flous. Le modele fournit la base pour I’implentation d’un 
systeme generique de gestion de catalogue hautement interactif qui est 
conqu pour pouvoir etre interfage avec une base de donnees relationnelle, 
des moteurs de recherche ou des structures d’index specifiques. 
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Resume. Ce papier concerne I’interpretation et la trace de documents 
savants dans les communautes de recherche distribuees. Nous pensons 
que les approches actuelles de description de documents et les infrastruc- 
tures technologiques, en particulier sur le World Wide Web, apportent 
peu d’aide pour ces taches. Nous decrivons la conception d’un serveur de 
bibliotheque digitale qui permettra aux auteurs de soumettre un resume 
des contributions qu’ils afiirment leur document apporter, et la relation 
du document avec la litterature. Nous decrivons un environnement Web 
base sur la connaissance afin d’encourager I’emergence d’un hypertexte 
semantique construit par la communaute, et les services qu’il pourra ap- 
porter a I’interpretation d’une idee ou d’un document dans le contexte 
de sa litterature. La discussion examine en details comment cette ap- 
proche aborde les problemes d’utilisation d’environnements de structure 
de connaissance. 
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Resume. Je montre que le World Wide Web est un reseau “petit monde” 
puisque le plus grand ensemble des sites relies possede des sous-ensembles 
de sites fortement connectes, tout en ayant de petites distances en- 
tre deux sites quelconques. Je demontre les avantages d’un moteur de 
recherche qui se sert du fait que les pages correspondant a une requete 
particuliere de recherche peuvent former un tel reseau. Dans une autre 
application, le moteur de recherche utilise cette propriete pour mesurer 
la connectivite entre les communautes sur le Web. 
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Resume. Nous presentons le modele “objets malins, archives stnpi- 
des” (Smart Object, Dumb Archive, SODA) pour les bibliotheques dig- 
itales (DLs). Le modele SODA transfere aux objets archives eux-memes 
des fonctions traditionnellement associees aux archives. Nous exploitons 
ce deplacement de responsabilite pour facilieter d’autres objectifs des 
DL, tels que I’interoperabilite, I’intelligence et la mobilite d’objet, ainsi 
que I’heterogeneite. Les objets d’une bibliotheque SODA negocient la 
presentation de leur contenu et manipulent leurs propres modalites et 
conditions. Dans cet article nous presentons des implantations de nos 
objets futes, des containers, et de I’archive bete (DA). Nous discutons le 
statut des containers et de la DA et comment ils sont utilises dans une 
variete de projects de bibliotheques digitales. 
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